Machine Learning Engineer — Training Optimization at Featherless AI | Torre
warning

Heads-up

The job you’re trying to post already exists in Torre:

Machine Learning Engineer — Training Optimization

You'll optimize large-scale model training, directly shaping the future of AI iteration and deployment.
Emma highlights
This highlight was written by Emma’s AI. Ask Emma to edit it.
Full-time

Legal agreement: Employment

Provide your expected compensation while applying
location_on
Remote (anywhere)
Match
skeleton-gauges
You have opted out of job matches in .
To undo this, go to the 'Skills and Interests' section of your preferences.
Review preferences
Shared by
Emma of Torre.ai
about 1 month ago

Requirements and responsibilities


About the RoleWe’re looking for an ML Engineer focused on training optimization to help us scale and improve large-scale model training. You’ll work at the intersection of research and production, optimizing training pipelines for speed, stability, and cost—while collaborating closely with researchers pushing model architecture and capability forward.This is a high-impact role with real ownership: your work directly affects how fast we can iterate, how large we can scale, and how efficiently we deploy new models.What You’ll DoOptimize large-scale model training pipelines (throughput, convergence, stability, and cost)Improve distributed training strategies (data, model, and pipeline parallelism)Tune optimizers, schedulers, batch sizing, and precision (bf16 / fp16 / fp8)Reduce training time and compute cost via profiling, bottleneck analysis, and systems-level improvementsCollaborate with researchers on architecture-aware training strategiesBuild and maintain robust training infrastructure (checkpointing, fault tolerance, reproducibility)Evaluate and integrate new training techniques (e.g. gradient checkpointing, ZeRO, FSDP, custom kernels)Own training performance metrics and continuously push them forwardWhat We’re Looking ForStrong experience training large neural networks (LLMs or similarly large models)Hands-on experience with training optimization (not just model usage)Solid understanding of:Backpropagation, optimization algorithms, and training dynamicsDistributed systems for ML trainingExperience with PyTorch (required)Comfort working close to hardware (GPUs, memory, networking constraints)Ability to move fluidly between research ideas and production-ready codeNice to HaveExperience with large-scale distributed training (multi-node, multi-GPU)Familiarity with DeepSpeed, FSDP, Megatron, or custom training stacksExperience optimizing training on AMD or NVIDIA GPUsContributions to open-source ML infrastructure or research codebasesExposure to non-Transformer architectures (RNNs, hybrid models, etc.)Why Join UsReal ownership at Series-A stage — your work shapes the company’s trajectoryWork on cutting-edge models and training systems at scaleSmall, highly technical team with fast feedback loopsStrong emphasis on engineering quality and research rigorCompetitive compensation + meaningful equity
Optionally, you can add more information later (benefits, pre-screening questions, etc.)
check_circle

Payment confirmed

A member of the Torre team will contact you shortly

In the meantime, continue adding information to your job opening.