Ishan Ojha

Data Scientist & Engineer

MS Data Science candidate at Arizona State University, specializing in machine learning, data pipelines, and advanced analytics. Building data-driven solutions that transform raw information into actionable insights.

Experience

Data Engineer Intern

Rogers Communications

Jan 2023 - Sep 2023

Toronto, Ontario

•Maintained and extended AWS-based streaming pipelines during Rogers' merger-driven source system consolidation, reducing pipeline latency by 15% and failure rate by 33% across 5M+ subscriber records ingested from Kafka streams, REST APIs, and relational systems into Redshift.
•Engineered 20+ behavioral features (session depth, feature adoption velocity, inactivity decay) using complex multi-table SQL transformations and stored procedures on raw transaction streams, producing analysis-ready datasets consumed by downstream fraud detection scoring workflows.
•Built Python batch feature pipelines with idempotent logic, automated failure recovery, and run-level monitoring integrated with CloudWatch alerting for reliable scheduled execution of daily fraud risk scoring workflows.

Data Scientist Intern

Loblaw Companies Limited

May 2022 - Jan 2023

Brampton, Ontario

•Designed and analyzed customer segmentation across 2.8M+ transactions using K-Means and hierarchical clustering (silhouette scores 0.38–0.44), identifying 5 behaviorally distinct segments by region and promotion responsiveness, driving an 11% lift in campaign response rate and 14% improvement in conversion.
•Built regression and tree-based models to estimate promotional demand lift, achieving R² of 0.62 on holdout data using 5-fold cross-validation and surfaced key feature drivers shaping campaign performance.
•Conducted statistical analysis including hypothesis testing, confidence intervals, and regression on promotional interaction data to quantify impact across customer segments; results directly informed model updates and deployment to AWS SageMaker testing environment.

Data Analyst Intern

HomeStars

Jun 2021 - Jan 2022

Toronto, Ontario

•Designed and analyzed an A/B test on business onboarding flow variants across 13K businesses, measuring end-to-end signup conversion rate and identifying a statistically significant 9% lift in the winning variant.
•Built Tableau dashboards and reports surfacing geographic distribution and business type composition across provinces, directly informing regional marketing strategies for business acquisition.
•Standardized and cleaned tens of thousands of inconsistent business signup records, resolving duplicate entries, missing fields, and formatting inconsistencies to produce reliable datasets for downstream analysis.

Education

Arizona State University

Masters of Science, Data Science, Analytics & Engineering

Sep 2024 - May 2026

Tempe, Arizona

GPA: 3.9

Relevant Coursework:

Machine LearningBig Data AnalyticsArtificial IntelligenceOptimizationData VisualizationDatabase Management

York University

Bachelors of Arts (Hons), Information Technology

Jan 2020 - Apr 2024

Toronto, Ontario

Technical Skills

Data Science

PyTorchHuggingFace TransformersBERT Fine-TuningFinBERTRAG (Retrieval-Augmented Generation)FAISSLLM EvaluationModel Calibrationscikit-learnClassificationRegressionTree-Based ModelsClusteringFeature EngineeringCross ValidationFeature Importance Analysis

Projects

Estimating Incremental Marketing Impact Using Uplift Modeling

Sep 2024 - Dec 2024

Causal InferencePythonUplift ModelingA/B Testing

•Framed a marketing optimization problem as a causal inference task using the Hillstrom Email Marketing Dataset with 64K customers across randomized treatment and control groups.
•Engineered behavioral features including recency, frequency, and historical spend, and estimated Average Treatment Effect via difference-in-means and logistic regression with treatment interaction terms, validated through bootstrapped confidence intervals.
•Implemented T-Learner and uplift decision tree models to estimate heterogeneous treatment effects across customer segments; top-decile segments showed 18% higher incremental conversion versus random targeting.

View on GitHub

Banking Risk Models: Credit Risk & AML Detection

Jan 2025 - May 2025

Machine LearningPythonRisk ModelingAnomaly Detection

•Built a probability of default model on Lending Club data using Weight of Evidence binning and Basel II-aligned scorecard methodology, with Gradient Boosting achieving AUC 0.78, Gini 0.56, and KS 0.42 on holdout data.
•Developed an AML transaction anomaly detection pipeline on synthetic data engineered to reflect real laundering patterns including structuring, rapid wire transfers, and dormant account activity.
•Addressed severe class imbalance of roughly 2% suspicious transactions using Average Precision as the primary evaluation metric, with the ensemble model improving Average Precision from 0.08 to 0.16.

View on GitHub

FinSentEval: LLM Evaluation Framework for Financial Sentiment

Jan 2026 - Apr 2026

NLPFinBERTRAGFAISSLLM EvaluationFinancial Sentiment

•Built a financial sentiment evaluation framework comparing FinBERT, zero-shot, few-shot, and RAG-augmented LLM inference across 4,840 labeled financial news sentences — RAG dynamic retrieval from a FAISS-indexed corpus outperformed static few-shot by 6 F1 points on sarcasm and jargon edge cases.
•Designed a confidence-based cascading classifier routing predictions to FinBERT (48ms, 1× cost) or RAG-LLM (960ms, 40× cost) based on a swept confidence threshold τ — escalating only 19% of predictions while achieving peak F1, reducing average inference cost by 3.2× vs LLM-only.
•Revealed FinBERT is systematically overconfident above 0.85 via calibration analysis (ECE=0.18, MCE=0.41), finding naively set thresholds would misroute ~27% of predictions — FAISS retrieval achieved Precision@3 of 0.47 vs 0.33 random baseline.

View on GitHub

Streaming Data Processing Pipeline for Market Analytics

Feb 2024 - Aug 2024

KafkaSpark StreamingAWSReal-timeS3Redshift

•Built an event-driven Kafka and Spark Structured Streaming pipeline consuming a live Alpaca Markets WebSocket feed, processing 2.5M stock tick events/day and reducing data availability latency from end-of-day batch to sub-5-minute intraday windows.
•Implemented fault tolerance under simulated traffic spikes via checkpointing, offset management, and automated offset replay with consumer lag monitoring, ensuring recoverability without data loss.
•Designed a unified streaming-to-batch architecture integrating Kafka, S3, and Redshift to serve both sub-5-minute intraday aggregations and daily historical rollups from a single pipeline.

View on GitHub

E-Commerce Analytics Pipeline

Jan 2025 - May 2025

dbtAirflowAzurePostgreSQLStar Schema

•Architected a layered dbt Core transformation pipeline (staging → intermediate → marts) over the Instacart Market Basket dataset (30M+ order line items), producing a fully modeled star schema with 2 fact tables and 3 dimension tables.
•Engineered idempotent ingestion using Azure Data Factory to land raw CSV files into Azure Blob Storage before loading into PostgreSQL, orchestrated via Airflow on Astronomer Astro CLI with staged layer execution.
•Extended dbt with 3 custom macros (null-safe division, dynamic metric bucketing, multi-environment schema isolation) and 2 singular business logic tests to validate order sequence continuity and reorder flag consistency.

View on GitHub

NHL Analytics Data Warehouse (Databricks)

Sep 2025 - Dec 2025

DatabricksPySparkMedallion ArchitectureAirflowUnity Catalog

•Architected a Medallion Lakehouse (Bronze/Silver/Gold) in Databricks processing 20M+ records with distributed PySpark ETL pipelines, Hive-style partitioning, and tuned shuffle and partition sizing, reducing season-level aggregation query times by 35%.
•Orchestrated reproducible seasonal data refreshes via Airflow DAGs with dependency management, backfill support, CI/CD-style failure alerting, and automated recovery consistent with production-grade pipeline standards.
•Enforced end-to-end data lineage and schema contracts using Unity Catalog across all Lakehouse layers, enabling schema evolution without breaking downstream consumers.

View on GitHub