Arslan Aslam

Arslan Aslam

About

Detail

Islamabad, Islamabad Capital Territory, Pakistan

Timeline


work
Job
school
Education
folder
Project

Résumé


Jobs verified_user 0% verified
  • BALANX-Bio
    Data Engineer Intern
    BALANX-Bio
    Dec 2025 - Current (8 months)
    • Built and maintained end-to-end RAG (Retrieval-Augmented Generation) pipelines using LangChain, enabling semantic retrieval over 50K+ enterprise documents and reducing manual data lookup time by approximately 40% • Developed and tested scalable AI/data pipelines with structured unit testing and validation checks, improving workflow reliability and reducing pipeline failures by 30% • Optimized ETL processes across 5+ data sources, improving data freshness from daily to near real-time and ensuring consistent, accurate downstream reporting • Integrated vector embeddings and AI-driven workflows into existing infrastructure by collaborating with cross-functional teams, laying the groundwork for intelligent search capabilities
Education verified_user 0% verified
    Projects (professional or personal) verified_user 0% verified
    • I
      Intelligent E-Commerce Customer Insights Agent
      Apr 2026 - Current (4 months)
      ⚙️ How it works: 🔹 Data pipeline follows the Medallion Architecture (Bronze → Silver → Gold), orchestrated via Apache Airflow DAGs. 🔹 Customer reviews are embedded using Voyage AI and stored in ChromaDB for semantic search. 🔹 A LangGraph agent intelligently routes each question to the right path — SQL for data queries, RAG for customer reviews, or both. 🔹 The final product is a Streamlit chat interface backed by a FastAPI service, fully containerized with Docker Compose.
    • R
      Real-Time Smart City End-to-End Pipeline
      Jan 2026 - Current (7 months)
      ⚙️ Workflow Sequence: 🔹 Real-time data is generated and streamed via Apache Kafka, where Spark consumes the streams and writes optimized partitioned data into AWS S3. 🔹 Arrival of new data triggers an AWS Lambda function → activating a Glue Crawler to automatically create/update tables for querying through Athena. 🔹 Success markers in S3 trigger an Airflow DAG, responsible for: • Detecting new data • Validating Glue tables • Repairing partitions • Loading curated data into Amazon Redshift 🔹 The entire project is containerized using Docker and version-controlled via GitHub. 🛠️ Tools & Technologies used: 🔹 Apache Zookeeper 🔹 Apache Kafka 🔹 Apache Spark ☁️ Amazon Web Services (IAM Role, S3 Buckets, Lambda, Glue, Athena, Redshift) ⚙️ Ap
    • S
      Sales ETL Pipeline
      Oct 2025 - Current (10 months)
      🔹 Objective: To design a modular ETL (Extract, Transform, Load) process for sales data using modern data engineering tools and practices. 🔹 Tools & Technologies Used: 🐳 Docker – to containerize Apache Airflow and PostgreSQL for smooth local orchestration. ⚙ Apache Airflow – to schedule and automate ETL workflows through DAGs with task dependencies and custom logging. 🐘 PostgreSQL (via DBeaver) – to perform data cleaning, filtering, and transformations using modular SQL scripts. 💻 Python (with psycopg2, Pandas, Jupyter Notebook) – for connecting to the database, validating data, and handling transformations programmatically. 🔹 Workflow Sequence: Extracted raw sales data and loaded it into PostgreSQL. Performed data cleaning, filtering,
    • E
      End-to-End AWS ETL Pipeline with Apache Airflow
      Oct 2025 - Current (10 months)
      ⚙️ Workflow Sequence: 🔹 The project begins with Data Cleaning & Validation of raw CSV files using an AWS Glue ETL Job Notebook, ensuring the datasets are consistent and analysis-ready. 🔹 The cleaned files were then stored securely in Amazon S3, serving as the central data lake for the pipeline. 🔹 Next, I created an AWS Glue Database and configured Crawlers to automatically infer schema and build data catalogs for the cleaned datasets — making them queryable in AWS Athena. 🔹 Using Amazon Athena, I wrote multiple SQL queries to extract valuable insights and perform analytical operations such as data filtering, aggregation, and optimization directly on S3 data. 🔹 For workflow automation, I integrated Apache Airflow locally through Docker
    • E
      ETL Pipeline
      Sep 2025 - Current (11 months)
      🔹 What this project does: Extracts raw transaction data into Databricks Transforms the data with PySpark & SQL (cleaning, aggregation, filtering) Loads structured results into tables for business insights 🔹 Skills applied: Databricks | Apache Spark | PySpark | SQL | Data Engineering Concepts This project gave me hands-on experience with building pipelines, handling transformations, and querying structured data.
    • O
      Online Voting System
      Dec 2024 - Current (1 year 8 months)
      A secure, province-based voting simulation implementing OOP and data management. ✔ Key Features: User Authentication: Login/registration system with password strength validation and unique voter ID checks. Province-Based Voting: Voters can only vote for candidates from their registered province (Punjab, Sindh, Balochistan, KPK). Admin Controls: View voter records, candidate vote counts, and election results. Data Integrity: Input validation to prevent invalid votes or duplicate registrations. Vote Tracking: Real-time vote counting and result declaration per province. ✔ Technical Highlights: OOP Concepts: Used struct to manage voter data (ID, province, votes) and modular functions for each operation. Memory Management: Employed pointers to d