Senior Site Reliability Engineer
Software Engineering
Manila, Philippines
8x8 connects our customers and teams globally, empowering CX leaders with performance and insights to make smarter decisions, delight customers, and drive lasting business impact.
About 8x8 UC Operations
The UC Operations team manages the production infrastructure behind 8x8's Unified Communications platform — voice, fax, messaging, and collaboration services used by enterprise customers globally. The team oversees dozens of applications running across more than two thousand service instances worldwide, spanning VoIP infrastructure, messaging brokers, storage systems, and cloud workloads across Oracle Cloud Infrastructure and physical datacenters.
UC Ops sits at the operational center of 8x8 — taking escalations from the NOC, coordinating with Engineering, and working alongside Support, Sales, and Professional Services. The work is complex, the systems are live, and the stakes are real. We are actively moving from reactive operations to a proactive, automation-first SRE model — and we are looking for a senior engineer who will help drive that transition: setting technical direction, leading initiatives across the subsystem, and raising the bar for the engineers around them.
What You'll Do
- Production Operations & Incident Response
- Own platform reliability across global UC infrastructure, driving incident response and the overall reliability strategy for your subsystem rather than resolving issues in isolation.
- Triage and resolve the hardest issues — service restarts, hung processes, infrastructure failures — and act as the senior escalation point for the NOC and for other engineers when frontline teams hit their limit.
- Execute and improve the unglamorous but essential work: scheduled maintenance, certificate renewals, log rotation — and redesign these processes so failure is prevented systemically, not handled case by case.
- Lead blameless post-mortems that produce real follow-through, and sign off on the corrective actions that come out of them.
Cross-Team Collaboration
- Work directly with Support, Sales, Sales Engineering, NOC, Professional Services, and Engineering teams across 8x8 — this team sits at the operational center of the company.
- Translate production events into clear, business-readable communication under pressure; stakeholders across the org depend on your judgment during incidents.
- Feed operational insight back into engineering — turning recurring failures and patterns into actionable bug reports, platform improvements, and influence over the architectural roadmap.
- Work closely with technical leads to align reliability and automation work with broader engineering goals, and help focus discussion on what matters most.
- Reliability Engineering & Automation
- Identify recurring manual work and build automation to eliminate it — we treat toil as a bug, not a requirement.
- Drive design for the tooling and automation in your domain; anticipate how a change in one component impacts others and account for adjacent domains in your designs.
- Understand the limits of our existing tools — and recognize when a problem exceeds those limits and deserves the effort of building a new one.
- Take on large-scale technical debt and refactoring across the subsystem, and contribute to the team's coding methodologies and best practices.
- Participate in 2-week sprint cycles to deliver automation, tooling improvements, runbook development, and infrastructure initiatives from a structured backlog. Own the functional specifications for large features and sign off on test plans.
- Address security issues as they arise — CVEs, misconfigurations, access control gaps — treated as first-class work alongside incident response.
- Define and track SLIs, SLOs, and SLAs to drive honest, data-driven conversations about where reliability investment is needed.
- Build and maintain dashboards (Grafana, OCI Log Analytics) that give the team genuine signal; tune alerting to eliminate noise — a high-noise on-call is itself a reliability failure.
- Leverage AI-powered tooling to accelerate diagnostics and reduce cognitive load at scale.
- Technical Leadership & Mentorship
- Provide technical leadership for projects involving 1–2 other engineers.
- Consistently mentor more junior engineers; be the person other developers seek out for constructive, insightful feedback.
- Frequently and actively share knowledge — of your own work, of areas you've worked in, and of obscure corners outside your immediate context — and encourage others to do the same.
- Run workshops, contribute to how-to guides, present at demos, and contribute to the team's presentation portfolio.
On-Call & Coverage
- Shared on-call rotation, approximately 1 week per month — same expectation for every engineer on the team.
- Escalation is always an option and is encouraged; you are expected to drive the response, set the pace for others, and know when to pull people in — not to hero it alone.
Tooling: PagerDuty for alerting, Jira for tracking, OCI Log Analytics and Grafana for diagnostics.
What We're Looking For
Required
- 6+ years in a site reliability, platform operations, or infrastructure engineering role — you have run production systems at scale and have a track record of driving, not just maintaining.
- Mastery of Linux systems administration: multi-service distributed systems, log reading, systemctl, network diagnostics, no GUI required.
- Deep hands-on experience with at least one major cloud provider (OCI, AWS, GCP, or Azure) — compute, storage, IAM, networking fundamentals.
- Strong on-call experience: calm under pressure, fast triage, clear communication during an incident, and the judgment to lead a multi-party response.
- Scripting in Python or Bash — enough to automate a task, parse logs, hit an API, and refactor someone else's tooling without breaking it.
- Strong incident response discipline: structured thinking, stakeholder communication, post-mortems that actually say something and drive follow-through.
- Solid command of SRE concepts: SLIs, SLOs, error budgets, toil measurement — and the ability to use them to steer investment decisions.
- Demonstrated technical leadership: driving designs, owning large features end to end, and mentoring other engineers.
- AI-forward mindset — you use AI tools as a core part of how you work, not as a novelty.
Preferred
- Experience with Oracle Cloud Infrastructure (OCI) — compute, networking, Log Analytics, Object Storage.
- Familiarity with VoIP and SIP infrastructure — registration, trunking, call signaling; this is a UC platform and that knowledge matters.
- Knowledge of observability tooling: Prometheus, Grafana, PagerDuty, OCI Log Analytics.
- Experience with Ansible for configuration management and deployment automation.
- Exposure to infrastructure migrations at scale in multi-tenant SaaS environments.
8x8 is proud to provide equal employment opportunities (EEO) to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability or genetics.
For 8x8 jobs located in the US: 8x8 participates in the E-Verify program.
View the Participant Poster in English and Español.
View the Right to Work Poster in English and Español.
We also provide reasonable accommodation to individuals with disabilities in accordance with applicable laws. Learn more or email us at careers@8x8.com (Include “Reasonable Accommodation” in the subject line)
Our Job Applicant Privacy Notice can be found here.
Learn more on our company website at www.8x8.com follow our pages on LinkedIn, Twitter, Facebook and Instagram.