Staff Database Reliability Engineer at Scribe | Torre

Staff Database Reliability Engineer

You'll architect and own the end-to-end data tier, scaling infrastructure and driving observability for millions of users.
Emma highlights
This highlight was written by Emma’s AI. Ask Emma to edit it.
Full-time

Legal agreement: Employment

Compensation
USD230k - 280k/year
location_on
Remote (for United States residents)
Shared by
Emma of Torre.ai
8 days ago

Requirements and responsibilities


We're hiring a Staff Database Reliability Engineer to own the strategy, architecture, and operational excellence of our data infrastructure. This is an expert-level IC role with deep influence on engineering direction, partnering closely with platform, backend, and DevOps engineers.Why this role mattersYou will own the data tier end-to-end. Design schemas and access patterns that scale, tune Aurora for latency and throughput, and set the standards for how engineers interact with our databases. When a migration script seizes up mid-deploy and writes start queueing behind an ACCESS EXCLUSIVE lock, your runbooks and automation resolve the incident quickly.Make the Django ORM a strength, not a liability:Review migrations for safety at scale — locks, backfills, concurrent index builds, NOT VALID constraintsCatch N+1 patterns and missing select_related/prefetch_related in reviewEstablish conventions for QuerySet usage and physical schema design (indexes, constraints, partitioning)Scale review through automation, not heroics — author AGENTS.md files and DNA scaffolding that encode our conventions, configure AI review bots (Claude Code, Cursor, etc.) to flag risky migrations and ORM anti-patterns, and iterate on those configs as new failure modes emergeLead major infrastructure initiatives:Capacity planning as traffic and engineering throughput growZero-downtime schema migrations and cutoversMulti-AZ resilience within a single region — Aurora writer/reader placement, failover behavior and RTO/RPO, ElastiCache and OpenSearch AZ topology, RabbitMQ survivability across AZsBackups, PITR, failover testing, retentionOwn the CDC pipeline (Aurora → DMS → S3 Parquet → Snowflake):DMS task design and tuning, replication slot hygiene on the Postgres sideSchema evolution as Django migrations roll through — so a column rename doesn't silently break the warehouse at 6 AMParquet layout and partitioning, reliability of the Snowflake handoffAutomated checks that flag migrations likely to break downstream consumersDrive observability across three complementary tools:pganalyze — query-level performance, index advisor, schema insights - the go-to for "why is this ORM query slow"CloudWatch — infrastructure metrics and alarms for Aurora, OpenSearch, ElastiCache, SQS, DMSHoneycomb — high-cardinality tracing that ties slow DB calls back to users, flags, deploys, and flowsShape how the three fit together, including Django-side instrumentation and trace attributes on ORM queriesBuild tooling and guardrails:Migration review automation and CI checks for risky patternsSlow query pipelines fed from pganalyzeSelf-service dashboards so teams understand their own query footprintSupport and evolve the rest of the stack:OpenSearch — index design, sharding, mapping changes, reindexing strategy, Django-side indexing pipelinesRedis — caching patterns, eviction, sizing, Django cache framework, Celery/RQ usage, avoiding hot keys and thundering herdsSQS + RabbitMQ — queue design, DLQs, visibility timeouts, exchange/queue topology, AZ mirroring, consumer backpressure, Celery behavior under loadWhat makes you a great fitCore expertise:Deep PostgreSQL — EXPLAIN (ANALYZE, BUFFERS), MVCC, bloat, lock contention, vacuum/autovacuum. Aurora Serverless V2 / Limitless experience strongly preferred (storage model, reader/writer split, ACU scaling)Strong ORM fluency (Django, SQLAlchemy, ActiveRecord, or similar) — predict the SQL a query will generate, spot N+1 problems on sight and how to control eager loading (joins vs. batched IN queries), column projection, aggregations, and subqueriesSingle-region multi-AZ design — practical understanding of what it does and doesn't protect againstData movement and observability:Production CDC experience, ideally AWS DMS — comfortable with logical replication, slot hygiene, schema evolution, and Parquet-based data lakes feeding Snowflake (or BigQuery/Redshift)Hands-on with pganalyze (or Datadog DBM / Performance Insights / pg_stat_statements pipelines), CloudWatch (custom metrics, composite alarms, log insights), and Honeycomb (or another high-cardinality tracing tool) — comfortable with OpenTelemetry and opinionated about what makes a trace usefulAI-assisted workflow:Real experience making AI coding and review tools useful for a team — writing AGENTS.md files, configuring review agents, versioning and iterating on prompts and configsThe rest of the stack:OpenSearch at scale — sizing, sharding, JVM tuning, rolling upgrades, snapshotsProduction Redis — persistence tradeoffs, cluster mode, hot keys, thundering herdsAt least one production message broker (SQS, RabbitMQ, Kafka) — delivery semantics, idempotency, failure modesEngineering and leadership:Strong automation and IaC background — real code (Python, Go, or similar) and TerraformTrack record leading cross-team initiatives, writing design docs that hold up, influencing without authorityComfortable in a high-growth environment where the right answer for 50 engineers isn't the right answer for 100Pragmatic outlook during incidents — focused on preventing the next oneFull-Time US Employee Benefits IncludeSome of the nicest and smartest teammates you’ll ever work withCompetitive salariesComprehensive healthcare benefitsExciting and motivating equityFlexible PTO401kParental LeaveCommuter Benefits (SF office employees)WFH StipendCompensationWe benchmark compensation using trusted market data and apply a tiered geographic framework to ensure competitive pay across locations. The ranges below represent the base salary band for this role by tier. Final offers are determined by experience, scope, internal parity, and location.$230k-$280k base + equityWe consider several factors when determining compensation, including location, experience, and other job-related factors.At Scribe, we celebrate our differences and are committed to creating a workplace where all employees feel supported and empowered to do their best work. We believe this benefits not only our employees but our product, customers, and community as well. Scribe is proud to be an Equal Opportunity Employer.
Optionally, you can add more information later (benefits, pre-screening questions, etc.)
check_circle

Payment confirmed

A member of the Torre team will contact you shortly

In the meantime, continue adding information to your job opening.