Experience
Data Engineer Intern
Rogers Communications
Jan 2023 - Sep 2023
Toronto, Ontario
- •Maintained and extended AWS-based streaming pipelines during Rogers' merger-driven source system consolidation, reducing pipeline latency by 15% and failure rate by 33% across 5M+ subscriber records ingested from Kafka streams, REST APIs, and relational systems into Redshift.
- •Engineered 20+ behavioral features (session depth, feature adoption velocity, inactivity decay) using complex multi-table SQL transformations and stored procedures on raw transaction streams, producing analysis-ready datasets consumed by downstream fraud detection scoring workflows.
- •Built Python batch feature pipelines with idempotent logic, automated failure recovery, and run-level monitoring integrated with CloudWatch alerting for reliable scheduled execution of daily fraud risk scoring workflows.
Data Scientist Intern
Loblaw Companies Limited
May 2022 - Jan 2023
Brampton, Ontario
- •Designed and analyzed customer segmentation across 2.8M+ transactions using K-Means and hierarchical clustering (silhouette scores 0.38–0.44), identifying 5 behaviorally distinct segments by region and promotion responsiveness, driving an 11% lift in campaign response rate and 14% improvement in conversion.
- •Built regression and tree-based models to estimate promotional demand lift, achieving R² of 0.62 on holdout data using 5-fold cross-validation and surfaced key feature drivers shaping campaign performance.
- •Conducted statistical analysis including hypothesis testing, confidence intervals, and regression on promotional interaction data to quantify impact across customer segments; results directly informed model updates and deployment to AWS SageMaker testing environment.
Data Analyst Intern
HomeStars
Jun 2021 - Jan 2022
Toronto, Ontario
- •Designed and analyzed an A/B test on business onboarding flow variants across 13K businesses, measuring end-to-end signup conversion rate and identifying a statistically significant 9% lift in the winning variant.
- •Built Tableau dashboards and reports surfacing geographic distribution and business type composition across provinces, directly informing regional marketing strategies for business acquisition.
- •Standardized and cleaned tens of thousands of inconsistent business signup records, resolving duplicate entries, missing fields, and formatting inconsistencies to produce reliable datasets for downstream analysis.
Education
Arizona State University
Masters of Science, Data Science, Analytics & Engineering
Sep 2024 - May 2026
Tempe, Arizona
GPA: 3.9
Relevant Coursework:
Machine LearningBig Data AnalyticsArtificial IntelligenceOptimizationData VisualizationDatabase Management
York University
Bachelors of Arts (Hons), Information Technology
Jan 2020 - Apr 2024
Toronto, Ontario
Technical Skills
Data Science
PyTorchHuggingFace TransformersBERT Fine-TuningFinBERTRAG (Retrieval-Augmented Generation)FAISSLLM EvaluationModel Calibrationscikit-learnClassificationRegressionTree-Based ModelsClusteringFeature EngineeringCross ValidationFeature Importance Analysis
Projects
Causal InferencePythonUplift ModelingA/B Testing
- •Framed a marketing optimization problem as a causal inference task using the Hillstrom Email Marketing Dataset with 64K customers across randomized treatment and control groups.
- •Engineered behavioral features including recency, frequency, and historical spend, and estimated Average Treatment Effect via difference-in-means and logistic regression with treatment interaction terms, validated through bootstrapped confidence intervals.
- •Implemented T-Learner and uplift decision tree models to estimate heterogeneous treatment effects across customer segments; top-decile segments showed 18% higher incremental conversion versus random targeting.
Machine LearningPythonRisk ModelingAnomaly Detection
- •Built a probability of default model on Lending Club data using Weight of Evidence binning and Basel II-aligned scorecard methodology, with Gradient Boosting achieving AUC 0.78, Gini 0.56, and KS 0.42 on holdout data.
- •Developed an AML transaction anomaly detection pipeline on synthetic data engineered to reflect real laundering patterns including structuring, rapid wire transfers, and dormant account activity.
- •Addressed severe class imbalance of roughly 2% suspicious transactions using Average Precision as the primary evaluation metric, with the ensemble model improving Average Precision from 0.08 to 0.16.
NLPFinBERTRAGFAISSLLM EvaluationFinancial Sentiment
- •Built a financial sentiment evaluation framework comparing FinBERT, zero-shot, few-shot, and RAG-augmented LLM inference across 4,840 labeled financial news sentences — RAG dynamic retrieval from a FAISS-indexed corpus outperformed static few-shot by 6 F1 points on sarcasm and jargon edge cases.
- •Designed a confidence-based cascading classifier routing predictions to FinBERT (48ms, 1× cost) or RAG-LLM (960ms, 40× cost) based on a swept confidence threshold τ — escalating only 19% of predictions while achieving peak F1, reducing average inference cost by 3.2× vs LLM-only.
- •Revealed FinBERT is systematically overconfident above 0.85 via calibration analysis (ECE=0.18, MCE=0.41), finding naively set thresholds would misroute ~27% of predictions — FAISS retrieval achieved Precision@3 of 0.47 vs 0.33 random baseline.
KafkaSpark StreamingAWSReal-timeS3Redshift
- •Built an event-driven Kafka and Spark Structured Streaming pipeline consuming a live Alpaca Markets WebSocket feed, processing 2.5M stock tick events/day and reducing data availability latency from end-of-day batch to sub-5-minute intraday windows.
- •Implemented fault tolerance under simulated traffic spikes via checkpointing, offset management, and automated offset replay with consumer lag monitoring, ensuring recoverability without data loss.
- •Designed a unified streaming-to-batch architecture integrating Kafka, S3, and Redshift to serve both sub-5-minute intraday aggregations and daily historical rollups from a single pipeline.
dbtAirflowAzurePostgreSQLStar Schema
- •Architected a layered dbt Core transformation pipeline (staging → intermediate → marts) over the Instacart Market Basket dataset (30M+ order line items), producing a fully modeled star schema with 2 fact tables and 3 dimension tables.
- •Engineered idempotent ingestion using Azure Data Factory to land raw CSV files into Azure Blob Storage before loading into PostgreSQL, orchestrated via Airflow on Astronomer Astro CLI with staged layer execution.
- •Extended dbt with 3 custom macros (null-safe division, dynamic metric bucketing, multi-environment schema isolation) and 2 singular business logic tests to validate order sequence continuity and reorder flag consistency.
DatabricksPySparkMedallion ArchitectureAirflowUnity Catalog
- •Architected a Medallion Lakehouse (Bronze/Silver/Gold) in Databricks processing 20M+ records with distributed PySpark ETL pipelines, Hive-style partitioning, and tuned shuffle and partition sizing, reducing season-level aggregation query times by 35%.
- •Orchestrated reproducible seasonal data refreshes via Airflow DAGs with dependency management, backfill support, CI/CD-style failure alerting, and automated recovery consistent with production-grade pipeline standards.
- •Enforced end-to-end data lineage and schema contracts using Unity Catalog across all Lakehouse layers, enabling schema evolution without breaking downstream consumers.




