DevOps / Site Reliability Engineer at Bespoke Labs

About Bespoke LabsBespoke Labs is an AI research and data company building the datasets, benchmarks, and evaluation infrastructure that power frontier AI models. We're backed by leading investors, trusted by top AI labs, and have research accepted at venues like ICLR 2026. Our team is small, moves fast, and has an outsized impact on how the next generation of AI is built.The RoleWe're looking for a mid-level DevOps / Site Reliability Engineer to own and scale our cloud infrastructure. You'll work closely with engineering and ML teams to keep our systems reliable, observable, and fast — directly supporting the infrastructure that powers AI data pipelines at scale.What You'll DoOwn cloud infrastructure on AWS — EC2, EKS, RDS, S3, IAM, VPCManage Kubernetes clusters and container orchestration end-to-endBuild and maintain CI/CD pipelines using GitHub Actions or similarImplement monitoring, alerting, and observability stacks (Prometheus, Grafana, or DataDog)Improve reliability, performance, and security of production systemsAutomate infrastructure with Terraform or similar IaC toolsDebug and resolve issues across complex, distributed systemsParticipate in design reviews and help raise the infrastructure barWhat We're Looking For3–5 years in DevOps, SRE, or infrastructure engineeringStrong AWS experience — EKS, EC2, RDS, S3, IAMKubernetes — deployment, scaling, troubleshooting in productionCI/CD pipelines — GitHub Actions, ArgoCD, or similarInfrastructure as Code — Terraform, Pulumi, or CDKPython or Go scriptingExperience working in production environments with real usersComfort with ambiguity and ability to operate autonomouslyNice to HaveExperience supporting ML training workloads or GPU clustersFamiliarity with distributed computing or large-scale data pipelinesPrior work at an AI, ML, or data companyOpen-source contributions or published technical writingWhat We OfferCompetitive compensation and meaningful equityDirect impact on frontier AI model training and evaluation infrastructureFlexible, remote-friendly environment with low bureaucracyA small, high-caliber team with deep AI research expertiseHealth, wellness, and learning & development benefits

DevOps / Site Reliability Engineer

Emma

Requirements and responsibilities

Skills wanted:

Language(s) required:

Andy Tseng

Mahesh Sathiamoorthy

About Bespoke Labs:

bespokelabs.ai/

Admin access needed

Payment confirmed

A member of the Torre team will contact you shortly