Machine Learning Engineer
Meta
Jun 2020 - Jul 2023 (3 years 2 months)
• Led distributed training of large-scale transformers on 2,048-GPU clusters with DeepSpeed ZeRO-3, NCCL, and PyTorch, achieving 5.7 TFLOP/GPU throughput and accelerating iteration cycles for 70B-parameter experiments, informing LLaMA and LLMs research. • Automated ingestion and preprocessing of 12TB+ daily conversational logs using Spark, PyArrow, Hive, and Airflow, generating privacy-compliant datasets that fueled RLHF pipelines and safety-aligned training for next-gen LLMs. • Architected production-grade AI-assisted pipelines with FBLearner Flow, Hydra, MLflow, Docker, and Kubernetes, enabling reproducible training, experiment tracking, scalable deployments, shadow deployments, and automated testing integrated into Meta CI/CD workflow