Arslan Aslam

I

Intelligent E-Commerce Customer Insights Agent

Apr 2026 - Current (4 months)

⚙️ How it works: 🔹 Data pipeline follows the Medallion Architecture (Bronze → Silver → Gold), orchestrated via Apache Airflow DAGs. 🔹 Customer reviews are embedded using Voyage AI and stored in ChromaDB for semantic search. 🔹 A LangGraph agent intelligently routes each question to the right path — SQL for data queries, RAG for customer reviews, or both. 🔹 The final product is a Streamlit chat interface backed by a FastAPI service, fully containerized with Docker Compose.

R

Real-Time Smart City End-to-End Pipeline

Jan 2026 - Current (7 months)

⚙️ Workflow Sequence: 🔹 Real-time data is generated and streamed via Apache Kafka, where Spark consumes the streams and writes optimized partitioned data into AWS S3. 🔹 Arrival of new data triggers an AWS Lambda function → activating a Glue Crawler to automatically create/update tables for querying through Athena. 🔹 Success markers in S3 trigger an Airflow DAG, responsible for: • Detecting new data • Validating Glue tables • Repairing partitions • Loading curated data into Amazon Redshift 🔹 The entire project is containerized using Docker and version-controlled via GitHub. 🛠️ Tools & Technologies used: 🔹 Apache Zookeeper 🔹 Apache Kafka 🔹 Apache Spark ☁️ Amazon Web Services (IAM Role, S3 Buckets, Lambda, Glue, Athena, Redshift) ⚙️ Ap

S

Sales ETL Pipeline

Oct 2025 - Current (10 months)

🔹 Objective: To design a modular ETL (Extract, Transform, Load) process for sales data using modern data engineering tools and practices. 🔹 Tools & Technologies Used: 🐳 Docker – to containerize Apache Airflow and PostgreSQL for smooth local orchestration. ⚙ Apache Airflow – to schedule and automate ETL workflows through DAGs with task dependencies and custom logging. 🐘 PostgreSQL (via DBeaver) – to perform data cleaning, filtering, and transformations using modular SQL scripts. 💻 Python (with psycopg2, Pandas, Jupyter Notebook) – for connecting to the database, validating data, and handling transformations programmatically. 🔹 Workflow Sequence: Extracted raw sales data and loaded it into PostgreSQL. Performed data cleaning, filtering,

E

End-to-End AWS ETL Pipeline with Apache Airflow

Oct 2025 - Current (10 months)

⚙️ Workflow Sequence: 🔹 The project begins with Data Cleaning & Validation of raw CSV files using an AWS Glue ETL Job Notebook, ensuring the datasets are consistent and analysis-ready. 🔹 The cleaned files were then stored securely in Amazon S3, serving as the central data lake for the pipeline. 🔹 Next, I created an AWS Glue Database and configured Crawlers to automatically infer schema and build data catalogs for the cleaned datasets — making them queryable in AWS Athena. 🔹 Using Amazon Athena, I wrote multiple SQL queries to extract valuable insights and perform analytical operations such as data filtering, aggregation, and optimization directly on S3 data. 🔹 For workflow automation, I integrated Apache Airflow locally through Docker

E

ETL Pipeline

Sep 2025 - Current (11 months)

🔹 What this project does: Extracts raw transaction data into Databricks Transforms the data with PySpark & SQL (cleaning, aggregation, filtering) Loads structured results into tables for business insights 🔹 Skills applied: Databricks | Apache Spark | PySpark | SQL | Data Engineering Concepts This project gave me hands-on experience with building pipelines, handling transformations, and querying structured data.

O

Online Voting System

Dec 2024 - Current (1 year 8 months)

A secure, province-based voting simulation implementing OOP and data management. ✔ Key Features: User Authentication: Login/registration system with password strength validation and unique voter ID checks. Province-Based Voting: Voters can only vote for candidates from their registered province (Punjab, Sindh, Balochistan, KPK). Admin Controls: View voter records, candidate vote counts, and election results. Data Integrity: Input validation to prevent invalid votes or duplicate registrations. Vote Tracking: Real-time vote counting and result declaration per province. ✔ Technical Highlights: OOP Concepts: Used struct to manage voter data (ID, province, votes) and modular functions for each operation. Memory Management: Employed pointers to d

About

Detail

Timeline

Résumé