Staff ML Infrastructure Engineer at Playlab | Torre

Staff ML Infrastructure Engineer

You'll architect accessible, cost-efficient AI infrastructure to empower educators and students globally.
Emma highlights
This highlight was written by Emma’s AI. Ask Emma to edit it.
Full-time

Legal agreement: Employment

Compensation
USD180k - 240k/year
location_on
Remote (anywhere)
Match
skeleton-gauges
You have opted out of job matches in .
To undo this, go to the 'Skills and Interests' section of your preferences.
Review preferences
Shared by
Emma of Torre.ai
2 months ago

Requirements and responsibilities


About PlaylabPlaylab is a tech non-profit dedicated to helping educators and students become critical consumers and creators of AI.We believe that an open-source, community-driven approach is key to harnessing the potential of AI in education. We equip communities with AI tools and hands-on professional development that empowers educators & students to build custom AI apps for their unique context. Over 60,000 educators have published apps on Playlab – and the impact is growing every day.At Playlab, we believe that AI is a new design material - one that should be shaped by many to bring their ideas about learning to life. If you're passionate about building creative, equitable futures for students and teachers, we hope you’ll join us.The RolePlaylab seeks a Staff Machine Learning Engineer to join our growing Engineering team. As a Staff ML Infrastructure Engineer, you'll be designing the systems that keep AI accessible as we grow - balancing cutting-edge capabilities with cost efficiency, powering research into what works in educational AI, and building toward a future where sophisticated AI can run anywhere in the world.Examples of the workBuild data pipelines that scrub PII, create research datasets, and power the research portal for educational AI studiesArchitect the path toward self-hosted and on-device model deployments for privacy and global accessibilityDesign and implement model orchestration systems that intelligently route requests across multiple AI providers (OpenAI, Anthropic, AWS Bedrock, open-source models)Build cost optimization infrastructure - implement conversation compression, prompt caching, and smart model selection to keep AI accessibleCreate comprehensive observability systems for ML operations - track costs, latency, quality, and usage patterns across thousands of applicationsDesign and implement infrastructure for fine-tuning and deploying custom modelsBuild monitoring and alerting systems that help us maintain reliability as AI interactions scaleAnd more…ExpectationsDesign, build, and maintain production ML infrastructure that balances performance, cost, and reliabilityOwn data quality and research dataset creation - ensure data is properly scrubbed, documented, and useful for research partnersStay on top of ML infrastructure technologies and techniques - from model serving to cost optimization to observability toolsWork cross-functionally with ML engineers, backend engineers, and product to ensure infrastructure supports real needsBalance innovation with operational excellence - experiment with new approaches while maintaining system reliability and data qualityMentor engineers on ML operations, cost optimization, and production ML best practicesQualifications7+ years building production ML/data systems, with experience in ML operations and infrastructureStrong experience with model serving, orchestration, and optimization in production environmentsProficient in Python and data pipeline technologies (Airflow, ETL tools, etc.)Experience with cloud infrastructure (AWS preferred) and containerization (Kubernetes, Docker)Experience with cost optimization strategies for LLM-based systemsThrive in high-agency, high collaboration culturesGreat communication that makes working remote-first workBonus Points For...Experience in education or building in edtechExperience with educational technology or mission-driven organizationsExperience with designing creative platformsExperience with LiteLLM or similar model routing frameworksBackground in privacy-preserving ML or PII handlingExperience building research data infrastructureContributions to open source ML infrastructure projectsTechnologiesPython, AWS, Kubernetes, Docker, Airflow, LiteLLM, PostgreSQL, Neo4J, Vector Databases, Terraform, Monitoring tools (New Relic, OpenTelemetry)
Optionally, you can add more information later (benefits, pre-screening questions, etc.)
check_circle

Payment confirmed

A member of the Torre team will contact you shortly

In the meantime, continue adding information to your job opening.