Senior DevOps Infrastructure Engineer (US)
What is the role?
NMI is seeking a Senior DevOps Engineer with deep Linux virtualization experience who is passionate about running applications in an exceedingly high availability environment within our SRE organization. This opportunity to work with similarly skilled professionals in a rapidly growing environment offers opportunities to level-up observability and automation skills while maintaining a mission critical, 4-nines availability platform, and participating in environment modernization.
The SRE team is responsible for the operation of all hardware and software within the production and SDLC environments. This consists of a global network connecting numerous sites which must be highly available 24x7 with a minimal desired target of 99.99% availability. The successful applicant as a Senior DevOps Engineer will be a core member of the SRE team with the opportunity to work with experts in the infrastructure, networking, and DevOps space.
The Ideal Candidate:
- Will have a track record of implementing low-toil solutions to traditionally high-touch operational or administrative tasks.
- Has a deep technical background and can engage with engineers with the nuances of complex systems, while also being able to zoom out and see the bigger picture.
- Enjoys being challenged to find creative solutions using both legacy and cutting edge technology. This is a codespeak for us having a legacy system that has to be maintained and improved while also looking at new technology and tools to improve resiliency, performance, ease of administration, and observability. It’s not all “the fun stuff”.
- Wants to work with a globally distributed team of similarly skilled professionals, and is comfortable building relationships with teammates up to thousands of miles away.
- Is as comfortable in a shell or VIM as an accountant is in QuickBooks.
- Refuses to believe a service or appliance is production ready until they have the metrics and alerts to prove it.
- Administration - Participate in maintenance and operations of our production environment, including patching, deployment, server administration, and troubleshooting, either using configuration as code tooling or manually.
- Reliability & Performance - Ensure reliability, availability and performance of services. Respond to incidents and resolve before they become customer impacting.
- Collaboration - Work closely with teammates, software, and security teams to rapidly meet customer, business, and compliance needs.
- Automation - Drive the automation of operational tasks, and ensure our infrastructure is more like cattle than pets.
- Observability - Develop and maintain internal and commercial or OSS tools to improve system health, performance, and deployment.
- Continuous Improvement - Drive never-ending improvement in SRE processes, tools, and methodologies. Take a leading role in blameless post-mortems to avoid repeat issues or mistakes and clearly document all lessons learned for others. If you love writing actionable documentation, we’d love to set up an interview.
- On-Call - Participate in a rotating 24x7 on-call schedule with your team to ensure availability of services across the production environment.
This is a fully remote role (work anywhere in the US); however, if you live within a reasonable commutable distance, we’d love to see you in the office from time to time! Periodic travel (typically 1-4 times a year) will be required to company colocation facilities, at company expense.