Varun Sharma Software Developer | AI/ML Systems | Distributed Data Pipelines |

U

UEFA Champions League Match Analysis – 2020 to 2023

Nov 2025

Overview: Analyzed match data from the UEFA Champions League across three seasons (2020-21, 2021-22, 2022-23) to uncover trends in team performance, home scoring patterns, possession dominance, and duel outcomes. The analysis combined sports analytics with SQL in Snowflake, providing insights into team strategies and game dynamics. Key Contributions: Identified the top 3 teams scoring the highest goals at home during the 2020-21 season: PSG – 5 goals Manchester United – 5 goals Barcelona – 5 goals Determined Liverpool had the most games with majority possession in the 2021-22 season (9 matches). Examined matches from the 2022-23 season to find teams that won duels but still lost, highlighting tactical dynamics across stages: Group stage,

P

Public Transport Journey Analysis – Transport for London (TfL)

Nov 2025

Overview: Analyzed a dataset containing millions of journeys across various transport types in London, spanning buses, Underground & DLR, overground, trams, TfL Rail, and the Emirates Airline cable car. The goal was to understand usage patterns, identify the most popular transport modes, and uncover temporal trends. Key Contributions: Calculated total journeys by transport type, identifying buses (24,905M journeys) and Underground & DLR (15,020M journeys) as the most heavily used modes. Analyzed month-by-month trends for Emirates Airline, highlighting peak travel months in May 2012 (0.53M journeys), June 2012 (0.38M), and April 2012 (0.24M). Examined journey volume trends for Underground & DLR, identifying the five years with the lowest

V

Video Game Sales and Ratings Analysis

Nov 2025

Video games are a multi-billion-dollar industry, and understanding both critical and user reception can reveal trends about the quality of games over time. In this project, I analyzed top-selling games, yearly critic scores, and user feedback to identify trends and “golden years” of gaming. Key Contributions & Insights: Best-Selling Games: Identified the top 10 highest-selling games, including Wii Sports (82.9M copies) and Super Mario Bros. (40.24M copies). Critics’ Favorite Years: Determined the top ten years by average critic score for years with at least four releases. 1998 (avg 9.32) and 2004 (avg 9.03) were among the highest-rated years. Golden Years (Critics & Users Agreement): Highlighted years where critics and users broadly agre

E

EV Charging Station Usage Analysis

Nov 2025

With the rise of electric vehicles, understanding charging station usage in apartment garages is crucial for efficient resource management. In this project, I analyzed EV charging session data to uncover insights about tenant charging habits. Key Contributions & Insights: Unique Users per Garage: Identified garages with the highest number of shared users (e.g., Bl2 with 18 users, AsO2 with 17). Peak Charging Times: Determined the top 10 most popular charging start times, including Sunday at 17:00 (30 sessions) and Friday at 15:00 (28 sessions). Long-Duration Users: Found users whose average session lasts over 10 hours, including Share-9 (16.85 hrs) and Share-17 (12.89 hrs). Impact: Provided actionable insights for apartment managers to

C

Case Study: Netflix Emerging Markets Subscriber Growth Playbook

Oct 2025

Situation: Netflix needed a recommendation playbook to optimize marketing spend across emerging markets for subscriber growth. Task: Identify which markets should receive increased or decreased investment based on efficiency and growth potential. Action: Gathered internal marketing and external market data, defined KPIs (ROI, CAC, CLV, retention, growth), performed exploratory analysis, and scored markets using composite metrics and 2x2 efficiency vs. growth frameworks. Validated recommendations via geo-split/A-B tests. Result: Delivered a data-driven framework guiding smarter marketing investment, improving ROI, lowering CAC, and prioritizing growth opportunities across emerging markets. Skills & Tools: Python, Pandas, NumPy, SQL, Power B

N

National Economic Intelligence Platform - Transforming Census Data into Strategic Policy Insights

Aug 2025

Situation: Federal agencies faced critical data quality issues in US household income datasets across 32,000+ geographic locations, hindering evidence-based economic policy development and strategic resource allocation for national economic programs. Task: Spearhead development of enterprise-grade national economic intelligence platform with comprehensive data governance frameworks for predictive economic modeling and policy recommendations. Action: Architected sophisticated data engineering ecosystem using advanced MySQL techniques for comprehensive data cleaning, validation, and exploratory analysis across multi-dimensional socioeconomic datasets. Implemented automated duplicate detection, statistical analysis across geographic hierarchie

P

Public Transportation Performance Intelligence - Optimizing Urban Mobility Through Data-Driven Insights

Aug 2025

Situation: NYC's bus transportation system experienced over 18,000 recorded operational disruptions affecting millions of daily commuters, creating substantial public service delivery gaps and straining municipal resources across multiple boroughs and contractors. Task: Spearhead comprehensive transportation analytics initiative to establish predictive operational intelligence for proactive maintenance strategies and evidence-based resource allocation. Action: Architected enterprise-grade transportation analytics platform integrating multi-year operational data (2019-2023) across 20+ bus companies and multiple service categories. Implemented advanced root cause analysis, contractor performance benchmarking, and predictive modeling for mecha

C

Customer Analytics & Segmentation Strategy | UFood Brazil Market Analysis

Aug 2025

Situation: UFood, Brazil's leading food delivery platform serving 1M+ consumers across 1,000+ cities, experienced slowing profit growth and needed data-driven insights to optimize marketing performance and customer acquisition in a competitive market. Task: Analyze 1,843 customer records with 39 demographic and purchasing features to identify high-value segments, understand campaign acceptance drivers, and develop actionable segmentation strategies to improve marketing ROI. Action: Performed comprehensive EDA using Python/pandas on demographics, purchase channels, and campaign response data. Conducted correlation analysis and advanced visualizations to uncover relationships between customer traits and spending patterns. Segmented customers

S

Smart Insights for Insurance Success: Customer Conversion Prediction

Aug 2025

Situation: Insurance companies faced significant challenges in customer acquisition efficiency, with conversion rates below 15%, marketing costs consuming 40% of revenue, and inability to identify high-potential prospects from diverse demographic segments, resulting in wasted resources and suboptimal ROI on annual marketing investments. Task: Develop comprehensive customer conversion prediction system to optimize insurance acquisition strategies, improve profitability, and enable data-driven prospect targeting across multiple customer segments. Action: Built advanced machine learning pipeline using Python with comprehensive data preprocessing including duplicate removal, categorical encoding via LabelEncoder, and class imbalance correction

G

Global Health Intelligence Platform - Transforming WHO Data into Strategic Health Policy Insights

Aug 2025

Situation: International health organizations faced fragmented data quality issues across 195+ countries spanning decades, hindering evidence-based policy development and strategic resource allocation for global health initiatives. Task: Spearhead development of enterprise-grade global health intelligence platform with comprehensive data governance frameworks and advanced analytical capabilities for predictive health modeling. Action: Architected sophisticated data engineering ecosystem using advanced SQL techniques for comprehensive data cleaning, validation, and exploratory analysis across multi-dimensional health datasets. Implemented automated data quality assessments, statistical correlation analysis between health indicators, and adva

F

Forecasting Avocado Prices Using ARIMA: Agricultural Market Intelligence

Aug 2025

Situation: Agricultural stakeholders faced significant uncertainty in avocado pricing decisions with price volatility reaching 40-60% seasonally, lacking reliable forecasting tools that resulted in suboptimal inventory management, production planning inefficiencies, and revenue losses due to unpredictable market fluctuations affecting $2.8B+ annual US avocado industry. Task: Develop sophisticated price forecasting system using advanced time series analysis to enable data-driven agricultural decision-making and optimize market strategies across supply chain stakeholders. Action: Built comprehensive ARIMA forecasting pipeline using Python, Pandas, Matplotlib, and Statsmodels for historical price analysis spanning 2004-2020. Implemented system

E

Exploring AutoScout24 Car Offers: German Automotive Market Intelligence

Aug 2025

Situation: German automotive market stakeholders faced information gaps in pricing strategies and inventory management, lacking comprehensive analysis of 100,000+ car listings on AutoScout24 platform. Dealerships struggled with optimal pricing decisions, while manufacturers needed competitive intelligence to guide market positioning and product development strategies. Task: Conduct comprehensive market intelligence analysis of German car listings to extract actionable insights for pricing optimization, demand forecasting, and competitive positioning across automotive ecosystem. Action: Built comprehensive data analysis pipeline using Python, Pandas, Seaborn, and Matplotlib to process extensive AutoScout24 dataset. Implemented advanced data

B

Bank Customer Churn Prediction & Segmentation

Aug 2025

Situation: A European bank was losing customers at an unsustainable rate, with limited visibility into churn patterns and no ability to implement proactive retention strategies, resulting in millions in lost revenue and increased acquisition costs. Task: Lead data science initiative to analyze 10,000 customer records, build predictive models, and create actionable customer segments to improve retention strategies. Action: Executed end-to-end data pipeline including dataset joins, missing value imputation, duplicate removal, and inconsistent labeling corrections. Conducted comprehensive EDA using box plots, histograms, and bar charts to uncover churn patterns by demographics and financial behavior. Engineered critical features including bala

F

Federal Debt Intelligence Platform - Transforming Economic Data into Strategic Policy Insights

Aug 2025

Situation: With US federal debt reaching $31.45 trillion and complex intergovernmental holdings structures, policymakers and economic research institutions needed sophisticated analytical capabilities to understand debt trajectory patterns and fiscal implications for strategic planning. Task: Spearhead development of comprehensive federal debt intelligence platform to establish predictive economic modeling capabilities and evidence-based policy recommendations. Action: Architected enterprise-grade economic analytics ecosystem integrating multi-year Treasury data (2015-2023) across three debt categories: public holdings ($24.6T), intragovernmental holdings ($6.8T), and total outstanding debt. Implemented advanced time-series analysis, season

C

Customer Service Performance Analytics Dashboard - Driving Operational Excellence Through Data

Aug 2025

Situation: Customer service team of 8 agents lacked real-time performance visibility, operating reactively rather than strategically with no insights into call patterns, agent efficiency, or resolution metrics, hindering optimal customer experience delivery. Task: Spearhead development of enterprise-grade analytics solution to establish data-driven performance management, enabling proactive decision-making and scalable operational improvements. Action: Architected comprehensive business intelligence platform integrating multiple data streams into unified analytics ecosystem. Built real-time KPI monitoring, predictive analytics for call volume forecasting, and automated performance benchmarking across team members. Established governance fra

C

Customer Retention Analytics Platform - Transforming Churn Intelligence into Revenue Protection

Aug 2025

Situation: Rising customer acquisition costs and revenue leakage from 2,000+ subscription customers across multiple payment methods highlighted critical need for sophisticated churn prediction and proactive retention strategies. Task: Lead development of enterprise-grade customer intelligence platform for predictive churn analytics, enabling proactive retention and lifecycle optimization. Action: Built unified churn analytics ecosystem integrating transaction data, contract demographics, and behavioral patterns. Implemented cohort analysis by tenure, payment method segmentation across 4 types (Bank transfer: 258, Credit card: 232, Electronic check: 1,071, Mailed check: 308), and predictive modeling for contract-type churn. Collaborated with

B

Bank Customer Churn Prediction

Aug 2025

Situation: Working with a European bank customer dataset of 10,000 records, I wanted to identify patterns that distinguish customers who leave from those who stay, and understand which factors drive churn decisions. Task: Build predictive models to identify at-risk customers and analyze which demographic and behavioral factors most strongly correlate with churn. Action: Performed data cleaning including joining datasets, handling missing values, removing duplicates, and fixing inconsistent labels. Conducted exploratory analysis using box plots, histograms, and bar charts to uncover churn patterns across demographics (age, geography, gender) and financial behavior (balance, products, tenure). Engineered features including balance-to-income r

B

Book Recommendation Engine (Goodreads Data)

Jul 2025

Situation: With thousands of books available, readers struggle to find their next good read. I wanted to build a content-based recommendation system using actual reader reviews and book metadata. Task: Create a recommendation engine that suggests books based on content similarity, genre patterns, and sentiment analysis of reader reviews. Action: Processed 13,000+ books and 1 million+ reviews from Goodreads dataset. Engineered 10+ features including sentiment scores from reviews using TextBlob, genre frequency analysis, description polarity, and reader volume signals. Implemented content-based filtering with text preprocessing (cleaning, tokenization, TF-IDF vectorization). Built Random Forest model to predict book ratings based on extracted

A

AI-Powered Book Recommendation Engine Using Goodreads Data

Jul 2025

Situation: Readers struggled to discover relevant books from massive catalogs, with traditional recommendation systems failing to capture personal preferences, sentiment trends, and hidden gems, leading to poor reading experience and reduced engagement. Task: Build AI-powered personalized book recommender using comprehensive Goodreads data to enable dynamic reading list generation tailored to individual preferences and discovery patterns. Action: Processed 13K+ books and 1M+ reviews, engineering 10+ features including sentiment scores, genre frequency, description polarity, and reader volume signals. Implemented content-based filtering with automated text cleaning and sentiment analysis using TextBlob. Built Random Forest model with compreh

S

Social Media Sentiment Analysis for Brand Monitoring | Python, spaCy, Machine Learning

Jun 2025

Situation: Brand management teams lacked automated capabilities to monitor social media sentiment at scale, relying on manual processes that couldn't keep pace with real-time brand mentions and potential reputation threats across digital platforms. Task: Develop end-to-end automated sentiment classification system to transform manual social media monitoring into intelligent, data-driven brand management. Action: Processed 9,896 tweets using advanced spaCy NLP pipeline with custom preprocessing for social media text. Engineered 1,466 TF-IDF features combined with linguistic analysis including POS tagging and named entity recognition. Trained and compared 4 ML algorithms with 5-fold cross-validation, selecting Logistic Regression as optimal p

A

AI Portfolio Risk Management | S&P 500 Deep Learning Strategy

Jun 2025

Situation: Investment firms faced significant losses from delayed market responses and human bias in decision-making, with manual analysis of 500+ S&P stocks creating bottlenecks where profitable opportunities disappeared within hours. Task: Develop enterprise-grade AI trading system integrating deep learning and quantitative finance for real-time portfolio decision-making and risk management in volatile markets. Action: Built neural network architecture using RNN (LSTM) for price forecasting and ANN for trend classification. Engineered 24 advanced features from 7 market indicators including moving averages, RSI, and volatility metrics. Processed 497K+ time-series records across 503 S&P stocks (2014-2017) using PyTorch implementation with G

S

S&P 500 Stock Price Prediction

Jun 2025

Situation: I wanted to understand if machine learning could predict stock price movements and whether deep learning approaches outperform traditional methods for financial time series data. Task: Build and compare neural network models to forecast S&P 500 stock prices using historical market data and technical indicators, then evaluate their real-world applicability. Action: Processed 497,000+ time-series records across 503 S&P stocks from 2014-2017. Engineered 24 features from 7 market indicators including moving averages (SMA, EMA), RSI, Bollinger Bands, and volatility metrics. Built two neural network architectures: LSTM (RNN) for price forecasting and ANN for trend classification. Implemented using PyTorch with GPU acceleration, dropout

S

Social Media Sentiment Analysis (Twitter Brand Monitoring)

Jun 2025

Situation: Brand mentions on social media generate massive volumes of text data. I wanted to build an automated sentiment classifier that could process tweets and identify positive, negative, or neutral brand sentiment at scale. Task: Develop NLP pipeline to classify tweet sentiment and extract insights about temporal patterns and key sentiment drivers. Action: Processed 9,896 tweets using spaCy NLP pipeline with custom preprocessing for social media text (handling hashtags, mentions, URLs). Engineered 1,466 TF-IDF features and combined with linguistic features including POS tagging and named entity recognition. Trained and compared 4 machine learning models (Naive Bayes, SVM, Random Forest, Logistic Regression) using 5-fold cross-validatio

C

Concrete Strength Prediction

Mar 2025

Situation: Construction projects wait 28 days to test concrete strength, creating delays and potential waste if batches fail quality standards. I wanted to predict strength earlier in the process using mix composition data. Task: Build machine learning model to predict concrete compressive strength based on ingredient proportions, reducing reliance on time-consuming physical testing. Action: Worked with dataset containing concrete mix compositions (cement, water, aggregates, age). Handled messy data with mixed formats—built custom parsing functions to extract numerical values from strings with multiple delimiters. Applied data cleaning pipeline including outlier treatment, missing value handling, and StandardScaler normalization. Split data

I

Intelligent RAG Chatbot for Automated Document Analysis

Jan 2025

Situation: I noticed document-heavy workflows—legal teams, researchers, technical professionals—spend hours manually searching through PDF repositories. I wanted to build a solution that could answer questions across multiple documents with source citations. Task: Build an intelligent document analysis system that could extract information from PDFs, answer natural language questions, and provide transparent source attribution—all while being actually deployable, not just a Jupyter notebook. Action: Built an end-to-end RAG pipeline using PyPDFLoader for PDF extraction and RecursiveCharacterTextSplitter for text chunking (1000 characters, 10-character overlap). Integrated HuggingFace all-MiniLM-L12-v2 embeddings with FAISS vector store for s

S

SWIN Transformer Image Classification

Sep 2024 - Oct 2024 (2 months)

Situation: I wanted to learn modern computer vision architectures beyond CNNs. SWIN Transformers were getting attention for image classification tasks, and I wanted hands-on experience implementing one for binary classification. Task: Implement and fine-tune a SWIN Transformer model on a binary image classification dataset to understand its performance characteristics and training requirements compared to traditional architectures. Action: Downloaded a Kaggle binary classification dataset. Implemented SWIN Transformer architecture using PyTorch/Hugging Face with data preprocessing and augmentation (random flips, rotations, color jittering). Configured training with Adam optimizer (learning rate: 0.003, weight decay: 0.3) for 10 epochs. Expe

S

Subscription Service Churn Prediction

Jun 2024

Situation: Working with a subscription service dataset (likely a tutorial/Kaggle dataset), I wanted to predict customer churn and understand which factors most strongly correlate with customers leaving. Task: Build classification models to identify at-risk customers and analyze which features (tenure, contract type, monthly charges, services used) best predict churn behavior. Action: Performed data cleaning: handled missing values in TotalCharges column, encoded categorical features (gender, contract type, internet service) using label encoding, scaled numerical features using StandardScaler. Conducted EDA with correlation heatmaps and distribution plots. Built three models: Logistic Regression for interpretability, Decision Tree for featur

I

IMDb Movie Sentiment Analysis

Jun 2024

Situation: I wanted to practice NLP techniques on a real-world text dataset. Movie reviews provide rich sentiment data with varied language, making them good for sentiment classification practice. Task: Build a sentiment classifier for movie reviews and compare multiple algorithms to understand which performs best on this type of text data. Action: Processed IMDb dataset with 14 features including review text, ratings, and movie metadata. Implemented text preprocessing pipeline: lowercasing, tokenization, stop-word removal using NLTK, and lemmatization. Generated sentiment scores using NLTK's VADER analyzer as baseline. Extracted TF-IDF features from review text. Trained four models: Naive Bayes, Logistic Regression, SVM, and Random Forest.

P

Parkinson's Disease Prediction

Jul 2023 - Sep 2023 (3 months)

Situation: Parkinson's Disease affects millions globally, and early detection improves treatment outcomes. I worked with a medical dataset containing vocal features that could potentially indicate Parkinson's presence. Task: Build classification models to distinguish Parkinson's patients from healthy individuals using vocal biomarkers, and identify which features are most predictive. Action: Analyzed dataset with 22 features including vocal measurements (jitter, shimmer, harmonic-to-noise ratio) from voice recordings. Applied Principal Component Analysis (PCA) to reduce dimensionality and identify key feature patterns. Implemented three models: Multi-Layer Perceptron (neural network with 2 hidden layers), Support Vector Machine with RBF ker

W

Wildfire Detection Using Deep Learning

Apr 2023 - Sep 2023 (6 months)

Situation: Wildfires cause massive destruction, and early detection can save lives and property. I wanted to explore whether deep learning could analyze satellite imagery or sensor data to identify wildfires earlier than traditional methods. Task: Build a computer vision model to detect wildfire presence and predict fire perimeter expansion using available wildfire datasets. Action: Worked with wildfire dataset containing spatial features (fireline length, perimeter, size, duration, spread speed). Applied data preprocessing: normalized features, handled missing values, split temporal sequences for time-series prediction. Implemented two approaches: MLP for tabular feature analysis and Random Forest for comparison. Trained models to predict

Varun Sharma

About

Detail

Timeline

Résumé