Site Reliability Engineer at HostPapa | Torre

Site Reliability Engineer

You'll ensure global SaaS platform reliability, scaling cloud commerce for major service providers worldwide.
Emma highlights
This highlight was written by Emma’s AI. Ask Emma to edit it.
Full-time

Legal agreement: Employment

Provide your expected compensation while applying
location_on
Remote (for Malaysia residents)
Match
skeleton-gauges
You have opted out of job matches in .
To undo this, go to the 'Skills and Interests' section of your preferences.
Review preferences
Shared by
Emma of Torre.ai
20 days ago

Requirements and responsibilities


Position Summary: With team members and customers in 39 countries around the globe, HostPapa is currently one of the fastest-growing web hosting companies with a wide range of products available. At its core, we provide individuals and small and medium-sized businesses with access to valuable tools and services critical to their online success, including a Website Builder service for making website creation an ultra-easy task for anyone. Tailored to meet every user's unique needs, our award-winning customer support, email, and cloud-based solutions keep HostPapa at the cutting edge of the web hosting industry and innovation by putting our customers first.This role focuses on CloudBlue, a HostPapa business that powers cloud commerce for many of the world’s largest service providers, including major Telcos, distributors, and MSPs. CloudBlue enables partners to monetize and manage cloud services and subscriptions at scale, combining the agility of a high-growth business with the backing of a global organization.As the Site Reliability Engineer, you will help ensure the reliability, scalability, and observability of CloudBlue’s multi-tenant SaaS platforms used by service providers worldwide. You will focus on improving system stability and performance through monitoring, high availability, and incident response, while working closely with DevOps, Platform, and Engineering teams to build and operate resilient production systems.What you’ll doDefine and implement SLIs, SLOs, and error budgets for critical CloudBlue services to ensure reliability and performanceInfluence system architecture with a strong focus on reliability, scalability, and operability, designing systems for fault tolerance, graceful degradation, and self-healingReduce operational toil by identifying opportunities for automation and process improvementDesign and operate CloudBlue’s observability stack across metrics, logs, and traces using tools such as Datadog, Grafana, and Elastic StackDevelop actionable alerting strategies and dashboards that provide clear insight into platform and business healthDesign and maintain high-availability architectures, implementing redundancy, failover, and disaster recovery strategies across regions and availability zonesConduct capacity planning, load testing, and performance optimization to ensure platform stability and scalabilityAct as a senior responder during production incidents, leading incident coordination, communication, and service restorationOwn blameless postmortems and drive improvements that reduce incident frequency, MTTR, and customer impactImprove reliability of Kubernetes-based platforms through health checks, autoscaling strategies, rollout safety, and resilience testingPartner with engineering and DevOps teams to improve deployment safety, rollback strategies, and platform reliabilityMaintain runbooks and operational documentation, and promote SRE best practices across engineering teamsSupport other tasks or projects as assigned to meet team and business needsAbout you3+ years of experience as an SRE, DevOps Engineer, or Production Engineer, with strong ownership of production systemsProven experience operating highly available, enterprise-grade, multi-tenant SaaS platformsHands-on experience with observability and monitoring tools such as Datadog, Grafana, and Elasticsearch/KibanaSolid understanding of Linux, networking, and distributed systems fundamentalsExperience working with containerized environments such as Docker and KubernetesStrong scripting and automation skills using Python and/or BashExperience participating in on-call rotations and incident response in production environmentsStrong written and spoken EnglishExperience defining SLIs/SLOs and managing error budgets at scale will be considered a plusExposure to hyperscale or service-provider-grade platforms is an advantageCloud experience, preferably with Azure; experience with AWS and/or GCP will also be valuedExperience working with hybrid or on-premises integrations is beneficialFamiliarity with chaos engineering and resilience testing will be considered an assetWhat We Offer:This is a remote opportunity. While we welcome applications globally, we are prioritizing candidates based in Malaysia due to team needs and coverageA competitive salary that values you and your unique skill setsCareer advancement & professional development opportunities to help you reach your full potentialFlexible work arrangements to support work/life balanceAbout Us:At HostPapa, we’ve been committed to providing a complete array of enterprise-grade cloud services solutions to every business owner since 2006. These services, traditionally out of reach to smaller businesses, are offered in a one-stop shop, making it quick and easy for customers to select the services they need to grow. We back these offerings with 24/7 award‑winning customer support in four languages.Our HostPapa team values diversity and inclusion. We have a friendly company culture built on trust and respect. With the acquisition of several companies into our product portfolio, we’re growing at an incredible rate and have ample opportunities for career growth. Come join our talented team of enthusiastic, hard-working, passionate, driven people engaged in meaningful, innovative work. We can’t wait to meet you!HostPapa is an equal-opportunity employer committed to diversity and inclusion. As a multicultural organization, we encourage individual achievement and recognize the strength of our diverse team.HostPapa is committed to providing accommodations for people with disabilities. If you require accommodation, please let us know, and we will work with you to meet your needs. Accommodation may be provided in all parts of the hiring process.It is anticipated that this position will be performed outside of Ontario.
Optionally, you can add more information later (benefits, pre-screening questions, etc.)
check_circle

Payment confirmed

A member of the Torre team will contact you shortly

In the meantime, continue adding information to your job opening.