Order.co is the System of Action for the Office of the CFO, transforming the way businesses purchase and pay into an intuitive, B2C-like shopping experience. Order.co leverages embedded AI agents and embedded financial products to reinvent the way businesses connect with their vendors. End users enjoy a seamless, zero-training buying experience, while finance and procurement leaders gain a single platform to orchestrate how the business “should operate”. The result is an all-in-one solution that serves as a gravitational pull for spend and data, automating and eliminating procurement and finance workflows from requisition to reconciliation along the way.Order.co is on the cutting edge of B2B Agentic Commerce, poised to be the market leader in creating a more predictive, prescriptive, and personalized experience for users. Founded in 2016 and headquartered in New York City, Order.co oversees nearly half a billion in annualized spend across hundreds of customers like WeWork, SoulCycle, Lume, and [solidcore]. Order.co has raised $75M in funding from industry-leading investors like MIT, Stage 2 Capital, Rally Ventures, 645 Ventures, and more. Order.co has been proudly named a 50 to Watch by Spend Matters and a Best Place to Work by BuiltIn and Inc. Magazine.The RoleAs a Senior Site Reliability Engineer on the Platform team, you will ensure that software systems are reliable, scalable, performant, and operationally efficient. You blend software engineering skills with infrastructure and operations expertise to keep critical systems running smoothly while enabling rapid product development. ResponsibilitiesReliability Engineering & Infrastructure OwnershipDesign, build, and operate highly available, scalable, and fault-tolerant infrastructure and platform servicesOwn reliability, availability, latency, and operational excellence for critical production systems and servicesDefine and maintain service level objectives (SLOs), service level indicators (SLIs), and error budgets across platform systemsLead incident response efforts for complex production outages; drive root-cause analysis and long-term remediation actionsBuild resilient systems that gracefully handle failures, traffic spikes, dependency degradation, and regional outagesContinuously improve system reliability through automation, observability, performance tuning, and capacity planningAutomation & Platform EngineeringDevelop infrastructure automation and self-service tooling to reduce operational toil and improve engineering velocityBuild and maintain CI/CD pipelines, deployment automation, and release engineering workflowsImplement infrastructure as code (IaC) practices using tools such as Terraform, CloudFormation, and container orchestrationImprove developer experience by building reliable internal platforms, operational tooling, and standardized deployment patternsDrive adoption of GitOps, immutable infrastructure, and automated remediation patternsObservability & Operational ExcellenceDesign and maintain comprehensive monitoring, logging, tracing, and alerting systems for distributed servicesEstablish actionable alerting standards that reduce noise while improving incident detection and response timesAnalyze production trends, system bottlenecks, and failure patterns to proactively prevent incidentsLead operational readiness reviews, disaster recovery planning, and game-day exercisesImprove mean time to detect (MTTD) and mean time to recovery (MTTR) through tooling, automation, and process refinementSystems Architecture & ScalabilityParticipate actively in architecture and infrastructure design reviewsPropose scalable and reliable platform designs that account for multi-region deployment, redundancy, failover, and security considerationsEvaluate trade-offs between reliability, scalability, operational complexity, and engineering velocityIdentify systemic risks and operational gaps before they become production incidentsPartner with engineering teams to ensure services are designed with operability, observability, and resilience in mind from day oneSecurity & ComplianceApproach infrastructure and operational practices with a strong security mindsetImplement and maintain secure cloud networking, secrets management, IAM policies, and infrastructure hardening standardsPartner with Security and Compliance teams to ensure systems meet organizational and regulatory requirementsDrive operational best practices around vulnerability management, patching, and production access controlsEnd-to-End Ownership & CollaborationScope and estimate infrastructure and reliability initiatives accuratelyCoordinate production rollouts, maintenance events, and reliability improvements across teamsCommunicate operational risks, dependencies, and incident impacts clearly to technical and non-technical stakeholdersCollaborate closely with Software Engineering, Security, Product, and Operations teams to improve platform reliability and scalabilityServe as a trusted escalation point during critical production incidentsMentorship & Technical LeadershipMentor junior and mid-level engineers on reliability engineering principles, operational excellence, and infrastructure best practicesRaise the operational maturity of the engineering organization through documentation, reviews, and technical guidanceDrive improvements in team standards around observability, incident management, automation, and infrastructure designInfluence technical decisions through credibility, operational expertise, and strong engineering judgmentQualificationsYou are motivated by accountability — you own outcomes, not just tasksYou are results-oriented and measure success by shipped, working softwareYou are motivated by correctness in code that touches money — the consequences of a bug land on real customer balances, and you take that seriouslyYou love helping people on your team grow and improveWriting tests is an integral part of your development process, not an afterthoughtYou know how to design and build software incrementally — you don't need a complete spec to make progressCollaborating with the people around you to achieve a goal motivates youYou are collaborative, open-minded, and actively developing your craftYou are curious and pragmatic about AI-driven solutions — you apply them where they add real value and stay skeptical where they don'tFamiliarity with AI-assisted development tools — you understand how they work, where they help, and where they fail. Prior hands-on use is a plus; intellectual curiosity and the instinct to evaluate AI output critically are what matterTechnical SkillsStrong foundation in computer science fundamentals: data structures, algorithms, and system designFamiliarity with building production-grade applications and services using Ruby and Ruby on RailsDeep expertise with Linux systems administration and production troubleshootingStrong experience operating cloud infrastructure at scale, particularly within AWS environmentsExperience with Kubernetes, container orchestration, and cloud-native infrastructure patternsProficiency with infrastructure as code tools such as Terraform or CloudFormationExpertise designing and operating CI/CD pipelines and deployment automation systemsDeep understanding of observability tooling including Datadog, OpenTelemetry, or similar platformsStrong knowledge of distributed systems reliability patterns including redundancy, failover, autoscaling, rate limiting, and graceful degradationExperience building automation and operational tooling using languages such as Python, Go, Bash, or RubyStrong understanding of networking fundamentals including DNS, load balancing, TLS, VPNs, firewalls, and service discoveryHands-on experience with incident response, root-cause analysis, and production operations in high-availability environmentsFamiliarity with SRE methodologies including SLOs, SLIs, error budgets, capacity planning, and operational maturity modelingExperience implementing secure infrastructure and cloud security best practices including IAM, secrets management, and vulnerability remediationProven ability to design scalable, resilient, and maintainable platform systems and APIsExperience supporting distributed microservices architectures and event-driven systemsStrong understanding of operational excellence principles including automation-first engineering and toil reductionExperience using AI-assisted engineering tools (e.g., Claude, GitHub Copilot) as force multipliers while applying sound operational and engineering judgmentExcellent debugging and systems thinking skills across infrastructure, networking, application, and platform layersWhat You’ll ReceiveCompetitive compensation including base salary, bonus, and equityEmployer-sponsored 401(k) with matchComprehensive medical, dental, and vision coverageFlexible time off and hybrid work environmentThe anticipated annual salary range for this role is $175,000 - $200,000. Actual compensation and title will be commensurate with experience, qualifications, knowledge, and skills.

Heads-up

Senior Site Reliability Engineer

Emma

Requirements and responsibilities

Skills wanted:

Language(s) required:

About Order.co:

work-order.co/

Admin access needed

Payment confirmed

A member of the Torre team will contact you shortly