AI Researcher — Inference Optimization at Featherless AI | Torre
warning

Heads-up

The job you’re trying to post already exists in Torre:

AI Researcher — Inference Optimization

You'll optimize large-scale AI inference, driving efficiency and deploying cutting-edge systems in production.
Emma highlights
This highlight was written by Emma’s AI. Ask Emma to edit it.
Full-time

Legal agreement: Employment

Provide your expected compensation while applying
location_on
Remote (anywhere)
Match
skeleton-gauges
You have opted out of job matches in .
To undo this, go to the 'Skills and Interests' section of your preferences.
Review preferences
Shared by
Emma of Torre.ai
20 days ago

Requirements and responsibilities


Role OverviewWe are seeking an AI Researcher with deep experience in inference optimization to design, evaluate, and deploy high-performance inference systems for large-scale machine learning models. You will work at the intersection of model architecture, systems engineering, and hardware-aware optimization, improving latency, throughput, and cost efficiency across real-world production environments.Key ResponsibilitiesResearch and develop techniques to optimize inference performance for large neural networks.Improve latency, throughput, memory efficiency, and cost per inference.Design and evaluate model-level optimizations (quantization, pruning, KV-cache optimization, architecture-aware simplifications).Implement systems-level optimizations (dynamic batching, kernel fusion, multi-GPU inference, prefill vs decode optimization).Benchmark inference workloads across hardware accelerators.Collaborate with engineering teams to deploy optimized inference pipelines.Translate research insights into production-ready improvements.Required QualificationsStrong background in machine learning, deep learning, or AI systems.Hands-on experience optimizing inference for large-scale models.Proficiency in Python and modern ML frameworks (e.g., PyTorch).Experience with inference tooling (e.g., Triton, TensorRT, vLLM, ONNX Runtime).Ability to design experiments and communicate results clearly.Preferred / Nice-to-Have QualificationsExperience deploying production inference systems at scale.Familiarity with distributed and multi-GPU inference.Experience contributing to open-source ML or inference frameworks.Authorship or co-authorship of peer-reviewed research papers in machine learning, systems, or related fields.Experience working close to hardware (CUDA, ROCm, profiling tools).What Success Looks LikeMeasurable gains in latency, throughput, and cost efficiency.Optimized inference systems running reliably in production.Research ideas successfully translated into deployable systems.Clear benchmarks and documentation that inform product decisions.Relevant Research Areas (Bonus)Long-context inference optimizationSpeculative decodingKV-cache compression and pagingEfficient decoding strategiesHardware-aware inference design
Optionally, you can add more information later (benefits, pre-screening questions, etc.)
check_circle

Payment confirmed

A member of the Torre team will contact you shortly

In the meantime, continue adding information to your job opening.