Site Reliability Engineer - AI Agents at Kraken | Torre
warning

Heads-up

The job you’re trying to post already exists in Torre:

Site Reliability Engineer - AI Agents

You'll build and scale AI infrastructure, powering open finance and global financial freedom.
Emma highlights
This highlight was written by Emma’s AI. Ask Emma to edit it.
Full-time

Legal agreement: Employment

Compensation
USD96k - 192k/year
location_on
Remote (for United States residents)
Match
skeleton-gauges
You have opted out of job matches in .
To undo this, go to the 'Skills and Interests' section of your preferences.
Review preferences
Shared by
Emma of Torre.ai
15 days ago

Requirements and responsibilities


OverviewBuilding the Future of Open FinancePayward - the parent company behind Kraken, NinjaTrader, Breakout, xStocks, Payward Services and CF Benchmarks - has spent the last 15 years building one of the most modern and globally accessible financial infrastructure platforms in the industry, built to advance an open, global financial system.Before you apply, we encourage you to explore our culture page to understand what drives us and how we work.The teamFounded in 2011, Kraken is one of the world's longest-standing crypto platforms, trusted by over 10 million individuals and institutions across the globe. It offers spot trading, margin, futures, staking, and OTC services, with products built for both individual investors and institutional clients.The AI Infrastructure team sits within the Data organization and is responsible for building, operating, and scaling the systems that power AI agents in production — both internal tools and external-facing products. Working closely with the AI and Agent Systems teams, this group ensures that the orchestration, execution, and model-serving layers underpinning agentic workflows are reliable, observable, and built to scale.This team operates at the intersection of data infrastructure and applied AI — a space that moves fast and demands engineers who can bring production discipline to emerging technology. You'll partner across Data Engineering, ML, and product-facing teams to harden agent infrastructure and keep it running at the standards our users expect.Importantly, this is a platform engineering team. Beyond operating infrastructure, the team is responsible for building the APIs, SDKs, and platform capabilities that enable AI, Data, and Engineering teams to safely and efficiently consume agent infrastructure as a service. Success in this role requires thinking beyond infrastructure operations and toward developer experience, platform adoption, and long-term scalability.The opportunityDesign, build, and operate the infrastructure layer supporting AI agent workflows in productionEnsure reliability, scalability, and observability of agentic systems across internal and external productsDesign and develop platform services, APIs, SDKs, and self-service capabilities that allow engineering teams to easily consume AI infrastructure and agent platform servicesManage and maintain the compute, orchestration, and serving infrastructure powering model inference and agent executionImplement robust monitoring, alerting, and incident response procedures tailored to AI/ML workloadsUtilize Infrastructure as Code (IaC) tools such as Terraform to provision and manage cloud (AWS) infrastructure componentsBuild and maintain CI/CD pipelines that support rapid, reliable deployment of AI services and agent workflowsDefine and implement guardrails, failure handling, and recovery patterns specific to agentic and LLM-powered systemsCollaborate with AI and Data Engineering teams to translate experimental agent prototypes into hardened production systemsManage containerized workloads using Kubernetes, ensuring efficient deployment, scaling, and orchestration of AI servicesImplement access controls and security best practices across AI infrastructure environmentsDocument architecture, runbooks, and best practices to support knowledge sharing across the teamWhat You Bring5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or similar role in a production environmentHands-on experience supporting ML infrastructure, model serving, or MLOps workflows in productionExperience building developer platforms, internal tooling, APIs, or SDKs consumed by engineering teams at scaleStrong understanding of platform engineering principles, including developer experience, self-service infrastructure, and API-driven platform designProficiency with Infrastructure as Code tools, particularly TerraformExperience with containerization and orchestration, particularly Kubernetes and DockerSolid understanding of cloud infrastructure, preferably AWSStrong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)Experience designing and operating observability, monitoring, and alerting systemsExperience implementing incident response procedures and participating in on-call rotationsStrong collaboration skills working across data, AI, and engineering teamsHigh ownership mindset in a fast-moving, high-stakes production environmentNice to havesExperience building or operating infrastructure for agent-based or LLM-powered systemsFamiliarity with agent orchestration frameworks (e.g., LangGraph, CrewAI, or similar)Background in data infrastructure, including familiarity with Airflow, Kafka, Spark, or data lake toolingExperience with CI/CD pipelines and deployment automation for AI/ML workloadsExposure to evaluation frameworks and model performance monitoring at scaleExperience working in fast-moving 0→1 environments or platform-building teamsExperience building SDKs, developer tooling, or internal platform products with a strong focus on usability and adoptionExperience with Cloudflare's cloud platform and product ecosystem, including networking, security, performance, and Zero Trust solutionsUnless a specific application deadline is stated in the job posting, applications are accepted on an ongoing basis.Please note, applicants are permitted to redact or remove information on their resume that identifies age, date of birth, or dates of attendance at or graduation from an educational institution.We consider qualified applicants with criminal histories for employment on our team, assessing candidates in a manner consistent with the requirements of the San Francisco Fair Chance Ordinance.Payward is powered by people from around the world and we celebrate the diverse talents, backgrounds, contributions, and unique perspectives that everyone brings to the table. We hire based on merit, seeking out people with the right abilities, knowledge, and skills for the job. We encourage you to apply for roles where you don't fully meet the listed requirements, especially if you're passionate or knowledgeable about crypto.We may ask candidates to complete job-related skills or work-style assessments as part of our hiring process. These assessments evaluate competencies relevant to the role and are applied consistently across candidates for similar positions. Results are considered alongside experience and interviews, and are not the sole basis for any employment decision.As an equal opportunity employer, we don't tolerate discrimination or harassment of any kind, whether based on race, ethnicity, age, gender identity, citizenship, religion, sexual orientation, disability, pregnancy, veteran status, or any other protected characteristic as outlined by federal, state, or local laws.This is the target annual salary range for this role. This range is not inclusive of other additional compensation elements, such as our Bonus program, Equity program, Wellness allowance, and other benefits [US Only] (including medical, dental, vision and 401(k)).The compensation range provided is influenced by various factors and represents the initial target range. Our salary offerings are dynamic and we strive to ensure that our base salary and total compensation package aligns and recognizes the top talent we aim to attract and retain. The compensation package of the successful candidate is based on various factors such as their skillset, experience, and job scope.
Optionally, you can add more information later (benefits, pre-screening questions, etc.)
check_circle

Payment confirmed

A member of the Torre team will contact you shortly

In the meantime, continue adding information to your job opening.