DevOps / Site Reliability Engineer at Bespoke Labs | Torre

DevOps / Site Reliability Engineer

You'll own and scale cloud infrastructure, directly powering frontier AI model training and evaluation.
Emma highlights
This highlight was written by Emma’s AI. Ask Emma to edit it.
Freelance
Recurrent
Hidden
(SIGN IN to learn if it’s a match)
The job admin(s) requested that this be hidden from the public.
location_on
Remote (anywhere)
Match
skeleton-gauges
You have opted out of job matches in .
To undo this, go to the 'Skills and Interests' section of your preferences.
Review preferences
Shared by
Emma of Torre.ai
about 1 month ago

Requirements and responsibilities


About Bespoke LabsBespoke Labs is an AI research and data company building the datasets, benchmarks, and evaluation infrastructure that power frontier AI models. We're backed by leading investors, trusted by top AI labs, and have research accepted at venues like ICLR 2026. Our team is small, moves fast, and has an outsized impact on how the next generation of AI is built.The RoleWe're looking for a mid-level DevOps / Site Reliability Engineer to own and scale our cloud infrastructure. You'll work closely with engineering and ML teams to keep our systems reliable, observable, and fast — directly supporting the infrastructure that powers AI data pipelines at scale.What You'll DoOwn cloud infrastructure on AWS — EC2, EKS, RDS, S3, IAM, VPCManage Kubernetes clusters and container orchestration end-to-endBuild and maintain CI/CD pipelines using GitHub Actions or similarImplement monitoring, alerting, and observability stacks (Prometheus, Grafana, or DataDog)Improve reliability, performance, and security of production systemsAutomate infrastructure with Terraform or similar IaC toolsDebug and resolve issues across complex, distributed systemsParticipate in design reviews and help raise the infrastructure barWhat We're Looking For3–5 years in DevOps, SRE, or infrastructure engineeringStrong AWS experience — EKS, EC2, RDS, S3, IAMKubernetes — deployment, scaling, troubleshooting in productionCI/CD pipelines — GitHub Actions, ArgoCD, or similarInfrastructure as Code — Terraform, Pulumi, or CDKPython or Go scriptingExperience working in production environments with real usersComfort with ambiguity and ability to operate autonomouslyNice to HaveExperience supporting ML training workloads or GPU clustersFamiliarity with distributed computing or large-scale data pipelinesPrior work at an AI, ML, or data companyOpen-source contributions or published technical writingWhat We OfferCompetitive compensation and meaningful equityDirect impact on frontier AI model training and evaluation infrastructureFlexible, remote-friendly environment with low bureaucracyA small, high-caliber team with deep AI research expertiseHealth, wellness, and learning & development benefits
Optionally, you can add more information later (benefits, pre-screening questions, etc.)
check_circle

Payment confirmed

A member of the Torre team will contact you shortly

In the meantime, continue adding information to your job opening.