Responsibilities
Operate and improve production and staging infrastructure for a cloud-hosted VoIP/UC platform.
Investigate and resolve issues involving SIP signaling, RTP/media, registrations, call routing, NAT traversal, presence/status indicators, voicemail notifications, provisioning, and customer-impacting call behavior.
Manage AWS-based infrastructure including compute, networking, DNS, IAM, storage, load balancing, monitoring, automation, and environment separation.
Support high-availability and disaster-recovery procedures across application, media, database, messaging, DNS, and networking layers.
Perform safe production maintenance, capacity planning, instance sizing, patching, upgrades, and controlled rollouts.
Build and maintain automation using Python, Bash, AWS tooling, and infrastructure-as-code practices.
Use logs, metrics, packet captures, call traces, monitoring dashboards, and CDR-style records to diagnose issues end-to-end.
Maintain and improve observability, alerting, incident response workflows, and operational dashboards.
Support customer provisioning and migration workflows for IP phones, softphones, web clients, and related voice services.
Collaborate with engineering, support, and leadership to prioritize reliability work, customer-impacting issues, and infrastructure improvements.
Create and maintain clear runbooks, standard operating procedures, and post-incident learnings.
Participate in incident response and planned maintenance windows when needed.
Required Experience
7+ years in infrastructure, systems engineering, network engineering, SRE, DevOps, or telecom operations.
3+ years operating production VoIP, SIP, UCaaS, PBX, carrier, contact-center, or real-time communications infrastructure.
Strong Linux administration skills in production environments.
Strong AWS experience, especially EC2, VPC networking, Route 53, IAM, load balancing, storage, monitoring, and automation.
Strong understanding of SIP, RTP, DNS/SRV, NAT traversal, TLS, WebRTC concepts, packet captures, and call-flow troubleshooting.
Experience with distributed systems, clustered services, message queues, replicated databases, and failover patterns.
Strong scripting ability with Python and Bash.
Experience with monitoring/logging systems and operational debugging using logs, metrics, traces, and packet captures.
Comfortable performing production changes with careful planning, validation, rollback thinking, and written procedures.
Strong written English and clear async communication.