SmarterDx

Staff Site Reliability Engineer

Posted 22 Hours Ago

Easy Apply

Remote

Hiring Remotely in United States

230K-250K Annually

Expert/Leader

Easy Apply

Remote

Hiring Remotely in United States

230K-250K Annually

Expert/Leader

The Staff Site Reliability Engineer will lead the reliability of production systems by defining SRE practices, improving observability, and ensuring fault-tolerance in cloud environments.

The summary above was generated by AI

SmarterDx, a Smarter Technologies company, builds clinical AI that is transforming how hospitals translate care into payment. Founded by physicians in 2020, our platform connects clinical context with revenue intelligence, helping health systems recover millions in missed revenue, improve quality scores, and appeal every denial. Become a Smartian and help optimize the way the healthcare system works for everyone. Learn more at smarterdx.com/careers.

Role

We are seeking a Staff Site Reliability Engineer (SRE) to lead the reliability, scalability, and operational excellence of our production systems. This role is responsible for defining and driving SRE practices across the organization, including SLIs/SLOs, incident management, capacity planning, and resilience engineering. You will design and implement automation that reduces toil, improve observability and performance across our Kubernetes and AWS environments, and ensure our systems are highly available and fault-tolerant.

The ideal candidate is a deeply technical engineer with strong distributed systems expertise, a passion for operational rigor, and a track record of improving reliability through thoughtful engineering, automation, and data-driven decision-making.

**This role is fully remote within the US**

What You’ll Do

Define and evolve reliability standards for the SmarterDx platform, including SLIs, SLOs, and error budgets that align engineering work with customer impact.
Implement a “reliability” platform using Terraform and infrastructure-as-code best practices.
Enhance observability systems (metrics, logs, traces, alerting) to provide actionable insights and reduce mean time to detect (MTTD) and resolve (MTTR).
Lead incident response, drive blameless postmortems, and implement systemic improvements to prevent recurrence.
Reduce operational toil through automation, self-healing systems, and improved deployment and rollback mechanisms.
Provide production support for the SmarterDx platform, applying SRE principles to ensure availability, performance, and data durability.
Research, prototype, and advocate for new reliability practices, tooling, and architectural improvements across the engineering organization.

What You Bring

10+ years of software and software reliability engineering experience, with significant time spent operating and scaling distributed systems in production environments.
3+ years of hands-on experience running cloud-native infrastructure in AWS, including deep familiarity with containers, Kubernetes, monitoring, and alerting in live production systems.
Proven experience defining and managing SLIs/SLOs, leading incident response, and driving postmortems and systemic reliability improvements.
Strong expertise with Terraform and infrastructure-as-code practices for managing production infrastructure safely and reproducibly.
Deep experience with Kubernetes architecture and operations, including workload reliability, cluster scaling, networking, and failure modes.
Experience working in security-conscious, compliance-oriented environments where reliability and data protection are first-class concerns.
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field — or equivalent practical experience operating large-scale systems.

Nice To Haves

Reliability engineering experience with production database systems (e.g. Postgres)

Our Tech Stack

AWS
Terraform
Kubernetes
Go, Python, Typescript
Postgres

Compensation

$230K to $250K base salary

#LI-DNI

Benefits

Medical, Dental & Vision – Comprehensive plans with leading insurance providers, covering 75% of your premiums, depending on the plan.
Paid Parental Leave – Generous paid leave to support families through birth or adoption: Up to 12 weeks for parents.
Remote-First Team – Work from anywhere in the U.S.
Unlimited PTO & 10 Holidays – So you can relax and recharge.
401(k) with Traditional & Roth Options – Tax-advantaged retirement savings through Fidelity with a 4% match.
Minimal Bureaucracy – A fast-moving, high-impact environment where you can focus on what matters.
Incredible Teammates! – Work alongside smart, supportive, and mission-driven colleagues.

Top Skills

AWS

Kubernetes

Postgres

Python

Terraform

Typescript

Similar Jobs

Optum

Senior Site Reliability Engineer

4 Hours Ago

In-Office or Remote

Eden Prairie, MN, USA

92K-164K Annually

Senior level

92K-164K Annually

Senior level

Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics

The Site Reliability Engineer will ensure system reliability and performance, automate processes, and collaborate with dev teams, focusing on AWS infrastructure and incident management.

Top Skills: AWSAws CloudformationCdkCloudwatchDynatraceGitGitlabLinuxPowershellPythonTerraform

Coalition

Site Reliability Engineer

4 Hours Ago

Remote

Location, WV, USA

150K-200K Annually

Senior level

150K-200K Annually

Senior level

Insurance • Cybersecurity

Lead AI enablement across engineering by designing and developing tools for AI-assisted development, driving tooling adoption, and ensuring infrastructure reliability in production environments.

Top Skills: Ai-Assisted Development ToolsAWSDatadogEcsGithub ActionsGoKubernetesPythonTerraform

Zscaler

Site Reliability Engineer

3 Days Ago

Easy Apply

Remote or Hybrid

San Jose, CA, USA

Easy Apply

119K-170K Annually

Senior level

119K-170K Annually

Senior level

Cloud • Information Technology • Security • Software • Cybersecurity

As a Staff Site Reliability Engineer, you'll oversee Zscaler production data center services, optimize code, and ensure cloud service availability and performance. Collaborate with cross-functional teams to improve processes and resolve escalated issues.

Top Skills: BashDnsFirewallsGrafanaHTTPIcmpLoad BalancingNagiosOsi ModelPrometheusPythonTcp/Ip

What you need to know about the Charlotte Tech Scene

Ranked among the hottest tech cities in 2024 by CompTIA, Charlotte is quickly cementing its place as a major U.S. tech hub. Home to more than 90,000 tech workers, the city’s ecosystem is primed for continued growth, fueled by billions in annual funding from heavyweights like Microsoft and RevTech Labs, which has created thousands of fintech jobs and made the city a go-to for tech pros looking for their next big opportunity.

Key Facts About Charlotte Tech

Number of Tech Workers: 90,859; 6.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Lowe’s, Bank of America, TIAA, Microsoft, Honeywell
Key Industries: Fintech, artificial intelligence, cybersecurity, cloud computing, e-commerce
Funding Landscape: $3.1 billion in venture capital funding in 2024 (CED)
Notable Investors: Microsoft, Google, Falfurrias Management Partners, RevTech Labs Foundation
Research Centers and Universities: University of North Carolina at Charlotte, Northeastern University, North Carolina Research Campus