Results-driven Site Reliability Engineer with extensive experience building and operating resilient, scalable, and highly available systems across multi-cloud environments (AWS, GCP, Azure). Specialized in defining and implementing SLIs, SLOs, and error budgets that reflect real user experience, making informed decisions between delivery velocity and system stability. Design and implementation of observability strategies based on the Golden Signals (latency, traffic, errors, saturation) using Grafana, Datadog, and Dynatrace, with actionable alerts that minimize fatigue and prioritize incidents with real user impact. Advocate for a blameless postmortem culture focused on identifying systemic root causes and driving concrete improvements, leading structured incident management processes with documented runbooks and effective on-call rotations. Committed to systematic toil reduction through intelligent automation with Terraform, Ansible, and Pulumi, and CI/CD pipeline optimization (GitHub Actions, GitLab CI, Jenkins, ArgoCD) for secure and reproducible deployments. Hands-on experience deploying and operating AI/ML infrastructure, including LLM-based applications with RAG architectures, vector databases, and model serving pipelines on Google Cloud Vertex AI, ensuring reliability, scalability, and cost efficiency for production AI workloads. Experienced in integrating DevSecOps practices from design, supporting HIPAA-compliant workloads with encryption, RBAC, and audit logging, and configuring edge security with Akamai (CDN, WAF, DDoS protection). Strong track record in FinOps implementing cloud cost governance, and leading cross-functional teams while promoting shared ownership of reliability and delivering high-performing systems aligned with both technical and business objectives.