Subzero Labs

Site Reliability Engineer

Posted 9 Days Ago

Remote

Hiring Remotely in USA

Expert/Leader

Remote

Hiring Remotely in USA

Expert/Leader

As a Site Reliability Engineer, you'll ensure scalability, performance, and reliability of blockchain applications, tackling operational challenges through automated solutions and proactive system designs.

The summary above was generated by AI

About Subzero Labs

Subzero Labs is building the next generation of decentralized infrastructure.

Site Reliability Engineer

We're building the infrastructure behind a next-generation decentralized programmable network with reliability, observability, and confidentiality baked in from the ground up. As a Site Reliability Engineer, you'll ensure the scalability, performance, and reliability of our large-scale blockchain applications and infrastructure.

Position Overview

Combining software engineering and systems administration expertise, you'll adopt a proactive, software-centric approach to tackle operational challenges. Your responsibilities include detecting issues, automating failure handling, devising disaster recovery plans, maintaining system uptime, and mitigating broken systems to prevent future disruptions. You'll leverage coding, automation, and engineering principles to build resilient, self-healing systems that scale to meet growing demands.

What You'll Do

Network Infrastructure Reliability: Design fault-tolerant systems to run validators, nodes, and indexers across cloud and bare-metal environments. Build self-healing mechanisms that recover automatically from faults, crashes, and partitions.

Infrastructure as Code: Define production systems using Terraform, Helm, Kubernetes, or Pulumi—supporting reproducible deployments, rapid scaling, and multi-region HA clusters.

TEE-Backed Secure Computation: Deploy and manage trusted execution environments (TEEs) such as Intel TDX, AMD SEV-SNP, or Azure Confidential VMs for secure blockchain operations.

Observability & Alerting: Build comprehensive Grafana dashboards and AlertManager alerts to monitor chain liveness, network performance, and quality metrics. Instrument services with tracing, metrics, and logs down to the hardware level.

Performance & Resource Tuning: Profile and tune workloads under sustained high throughput—optimizing CPU/memory/disk I/O pressure. Build tools to detect degraded validators or slow block propagation in real time.

Security Hardening & Key Management: Engineer hardened signing pipelines integrating TEEs, HSMs, or cloud-native KMS systems. Manage key lifecycle (rotation, expiration, revocation) with zero downtime while reducing attack surface area.

CI/CD & Safe Rollouts: Build GitHub Actions workflows testing and shipping changes across multiple environments. Own release engineering across devnet, testnet, and mainnet, ensuring protocol compatibility and seamless validator upgrades.

Incident Response & Chaos Engineering: Run fire drills, simulate node failures and partitions, and lead incident postmortems. Design for failure and validate assumptions under pressure.

Cross-Functional Communication: Work closely with engineers, product managers, node operators, and partners to support deployments, debug edge cases, and share best practices as a key interface between core protocol teams and the network operator ecosystem.

Requirements

10+ years in DevOps or SRE roles with focus on tooling, automation, and infrastructure
Proficiency in systems languages: Rust, Go, Python, Shell scripting
Experience writing and reviewing code, developing documentation and disaster recovery plans, debugging complex problems on live blockchain systems
Advanced knowledge of cloud infrastructure: networking, orchestration tools, containerization, compute, and storage systems
Proven ability to design, develop, and deploy systems enhancing throughput, latency, reliability, availability, and security
Clear communication skills: ability to explain technical concepts simply
Self-starter mindset: continuous learning and critical thinking under pressure

Preferred Qualifications

Background in distributed systems and consensus protocols
Experience with monitoring and observability platforms
Knowledge of security best practices for distributed systems

Top Skills

Grafana

Helm

Kubernetes

Pulumi

Python

Rust

Shell Scripting

Terraform

Similar Jobs

Zapier

Site Reliability Engineer

4 Days Ago

Remote

Mid level

Artificial Intelligence • Productivity • Software • Automation

As a Site Reliability Engineer at Zapier, you will enhance the reliability of systems, improve observability, and handle incident response, while collaborating with teams and contributing to automation efforts.

Top Skills: ArgocdAWSDatadogGitlabGoGrafanaKafkaKubernetesOpensearchPrometheusPythonRedisSentryTerraformTypescript

Rula

Staff Software Engineer

4 Days Ago

Remote

United States

Senior level

Healthtech • Other • Social Impact • Software • Telehealth

The Staff SRE & DevOps Engineer at Rula will enhance system robustness and scalability, promote observability, and adopt SRE best practices while collaborating with engineering teams.

Top Skills: AWSDevOpsKubernetesSre

NBCUniversal

Staff Software Engineer

4 Days Ago

Remote or Hybrid

New York, NY, USA

130K-180K Annually

Senior level

130K-180K Annually

Senior level

AdTech • Cloud • Digital Media • Information Technology • News + Entertainment • App development

Oversee operational support for SAP BTP applications, manage incidents, collaborate on engineering strategy, lead integration development, and ensure system performance.

Top Skills: AbapCapmIdentity ManagementIdocJSONMessage QueuesOauthOdataRestSAMLSap Api Business HubSap AribaSap BtpSap C4CSap CallidusSap CpiSap Success FactorsSfapiSftpSoapWorkdayXML

What you need to know about the Charlotte Tech Scene

Ranked among the hottest tech cities in 2024 by CompTIA, Charlotte is quickly cementing its place as a major U.S. tech hub. Home to more than 90,000 tech workers, the city’s ecosystem is primed for continued growth, fueled by billions in annual funding from heavyweights like Microsoft and RevTech Labs, which has created thousands of fintech jobs and made the city a go-to for tech pros looking for their next big opportunity.

Key Facts About Charlotte Tech

Number of Tech Workers: 90,859; 6.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Lowe’s, Bank of America, TIAA, Microsoft, Honeywell
Key Industries: Fintech, artificial intelligence, cybersecurity, cloud computing, e-commerce
Funding Landscape: $3.1 billion in venture capital funding in 2024 (CED)
Notable Investors: Microsoft, Google, Falfurrias Management Partners, RevTech Labs Foundation
Research Centers and Universities: University of North Carolina at Charlotte, Northeastern University, North Carolina Research Campus