Senior Site Reliability Engineer (SRE) - (GCP) at Devsu | Torre

Senior Site Reliability Engineer (SRE) - (GCP)

You'll elevate system reliability and observability across hybrid cloud environments, driving SRE excellence.
Emma highlights
This highlight was written by Emma’s AI. Ask Emma to edit it.
Full-time

Legal agreement: Employment

Provide your expected compensation while applying
location_on
Remote (for Colombia residents)
Remote (for Guatemala residents)
Remote (for Honduras residents)
Match
skeleton-gauges
You have opted out of job matches in .
To undo this, go to the 'Skills and Interests' section of your preferences.
Review preferences
Shared by
Emma of Torre.ai
19 days ago

Requirements and responsibilities


We are seeking a Site Reliability Engineer (SRE) with deep expertise in monitoring, observability, and reliability engineering to support systems running across on-premises infrastructure and Google Cloud Platform (GCP).This role is primarily responsible for designing, operating, and improving monitoring, alerting, and observability platforms, with a strong focus on Grafana and Kubernetes environments.As a secondary responsibility, this role provides backup coverage for the Application Support team during periods of resource constraints or major incidents, offering L2/L3 technical support when required.ResponsibilitiesMonitoring & Observability (Core Focus)Own and operate the monitoring and observability stack across on-prem and GCP environmentsDesign, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applicationsDefine, tune, and maintain alerts to ensure high signal-to-noise ratioEstablish observability standards and best practices across teamsImprove visibility into system health, performance, and reliabilitySite Reliability EngineeringApply SRE principles to improve availability, performance, and resilienceDefine and track SLIs, SLOs, and error budgetsParticipate in on-call rotations and SEV incident responseLead or contribute to incident investigations and root cause analysis (RCA)Drive preventative actions to reduce repeat incidentsKubernetes & Platform ReliabilitySupport and monitor Kubernetes environments (GKE and on-prem clusters)Monitor cluster health, capacity, and resource utilizationTroubleshoot platform-level issues impacting application reliabilityCollaborate with Platform and Engineering teams on reliability improvementsSecondary Responsibilities (Backup Application Support)These responsibilities are activated as needed, not part of day-to-day operations.Provide L2/L3 application support coverage during:Support team resource shortagesHigh-severity incidents (SEVs)Peak support periods or escalationsTriage and troubleshoot application issues using existing runbooks and dashboardsCollaborate with Application Support and Engineering teams during incidentsEnsure all actions, findings, and resolutions are documented in ServiceNow (SNOW)RequirementsStrong experience as a Site Reliability Engineer or Reliability EngineerDeep hands-on expertise with Grafana (dashboards, alerting, troubleshooting)Solid experience with monitoring and observability systemsProduction experience operating Kubernetes environmentsExperience supporting systems in GCP and on-prem environments (mandatory)Strong Linux systems and troubleshooting skillsFluent English (written and spoken).Ability to work in PST time zone.Ability to participate in an on-call rotation that includes coverage for one weekend day. Time worked during the weekend is compensated with one day off during the week, in accordance with the established work schedule.Technology Stack:Observability: Grafana, Prometheus, logging platformsContainers: Kubernetes (GKE and on-prem)Cloud: Google Cloud Platform (GCP)Operations: Linux, networking, infrastructure monitoringIncident Tools: PagerDuty, ServiceNow, Slack (or equivalents)Nice to have: Experience supporting application teams during SEV incidentsKnowledge of capacity planning and performance tuningScripting skills (Python, Bash, etc.)Experience with hybrid infrastructure environmentsBenefitsAt Devsu, we believe in creating an environment where you can thrive both personally and professionally. By joining our team, you’ll enjoy:A stable, long-term contract with opportunities for career growthPrivate health insuranceA remote-friendly culture that promotes work-life balanceContinuous training, mentorship, and learning programs to keep you at the forefront of the industryFree access to AI training resources and state-of-the-art AI tools to elevate your daily workA flexible Paid Time Off (PTO) policy as well as paid holiday daysChallenging, world-class software projects for clients in the US and LatAmCollaboration with some of the most talented software engineers in Latin America and the US, in a diverse work environmentJoin Devsu and discover a workplace that values your growth, supports your well-being, and empowers you to make a global impact.
Optionally, you can add more information later (benefits, pre-screening questions, etc.)
check_circle

Payment confirmed

A member of the Torre team will contact you shortly

In the meantime, continue adding information to your job opening.