Contract (long-term), US (PST, MST). Flexible rate depending on experience
Role Overview.
Own the reliability and release operations for our integration work. You’ll give developers a smooth path from code to production, keep environments healthy and secure, and make system health visible so issues are found and fixed fast. Over time, you’ll tune cost/performance, harden security, and evolve our standards so we can ship integrations predictably as we scale.
What you’ll do.
* Own the platform lifecycle: maintain and improve our cloud setup (DigitalOcean preferred), databases (Postgres), caches/queues (Redis), and the way environments are created (Terraform/IaC).
* Operate releases: keep CI/CD fast and safe (GitHub Actions), enforce health checks and rollbacks, and make deploys predictable across multiple integration workstreams.
* Make reliability visible: centralize logs/metrics/traces, keep practical alerts in place, and publish clear runbooks so first responders know what to do.
* Strengthen security & compliance basics: secrets handling, least-privilege access, image scanning, patches, and simple evidence for audits when needed.
* Manage capacity, cost, and performance: right-size resources, set autoscaling policies, and keep cloud spend within plan.
* Enable the team: answer “how do we…?” questions, write concise docs, and collaborate closely with the Fractional CTO to unblock delivery.
Success Metrics.
* Deployment success rate — % of deploys that complete without rollback. Target: ≥95%.
* Time to restore — median time to recover from a production incident. Target: ≤30 minutes.
* Operational visibility — core alerts verified monthly; runbooks exercised in a safe test. Target: 100% pass.
* Cost & capacity — stay within agreed monthly cloud budget while meeting performance targets.
Experience Required.
* DevOps/SRE: 8+ years running cloud-hosted applications end-to-end.
* IaC & containers: Terraform, Docker; reproducible environments and change control.
* CI/CD: GitHub Actions (or similar) with build/test/scan/sign, blue/green or rolling deploys, and proven rollback.
* Data & queues: operating managed Postgres and Redis at production scale.
* Observability & ops: logging/metrics/alerts (OpenTelemetry or equivalent), incident triage, basic on-call hygiene.
* Security controls: secrets management, certs, firewalls, IAM; incident response.
* Multi-tenant integration patterns (per-tenant config, fairness/rate-limit).
* Apigee knowledge; OpenTelemetry; experience with multi-tenant SaaS and token-bucket rate limiting.
Mindset.
Pragmatic and service-oriented • automates toil • documents as they go • calm in ambiguity • explains choices in plain English • raises the bar without heavy process.
If you’re interested in applying, please fill out this form: https://1uqiq.share.hsforms.com/2YaQDDyr4QZKaMfyJ44QxKw