Machine Learning Engineer — Inference Optimization at Featherless AI | Torre
warning

Heads-up

The job you’re trying to post already exists in Torre:

Machine Learning Engineer — Inference Optimization

You'll optimize large-scale ML model inference to deliver cutting-edge performance and real user impact.
Emma highlights
This highlight was written by Emma’s AI. Ask Emma to edit it.
Full-time

Legal agreement: To be defined

USD75.4K - 100K/year

~COP150M - 200M/year

+ Equity

+ Bonuses

location_on
Remote (anywhere)
Match
skeleton-gauges
You have opted out of job matches in .
To undo this, go to the 'Skills and Interests' section of your preferences.
Review preferences
Shared by
Emma of Torre.ai
20 days ago

Requirements and responsibilities


About the RoleWe’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains.What You’ll DoOptimize inference latency, throughput, and cost for large-scale ML models in productionProfile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)Implement and tune techniques such as:Quantization (fp16, bf16, int8, fp8)KV-cache optimization & reuseSpeculative decoding, batching, and streamingModel pruning or architectural simplifications for inferenceCollaborate with research engineers to productionize new model architecturesBuild and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setupsImprove system reliability, observability, and cost efficiency under real workloadsWhat We’re Looking ForStrong experience in ML inference optimization or high-performance ML systemsSolid understanding of deep learning internals (attention, memory layout, compute graphs)Hands-on experience with PyTorch (or similar) and model deploymentFamiliarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)Experience scaling inference for real users (not just research benchmarks)Comfortable working in fast-moving startup environments with ownership and ambiguityNice to HaveExperience with LLM or long-context model inferenceKnowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)Experience optimizing across different hardware vendorsOpen-source contributions in ML systems or inference toolingBackground in distributed systems or low-latency servicesWhy Join UsReal ownership over performance-critical systemsDirect impact on product reliability and unit economicsClose collaboration with research, infra, and productCompetitive compensation + meaningful equity at Series AA team that cares about engineering quality, not hype
Optionally, you can add more information later (benefits, pre-screening questions, etc.)
check_circle

Payment confirmed

A member of the Torre team will contact you shortly

In the meantime, continue adding information to your job opening.