iSpot.tv

Principal Site Reliability Engineer

Posted 4 Days Ago

Remote

Hiring Remotely in USA

164K-213K

Expert/Leader

Remote

Hiring Remotely in USA

164K-213K

Expert/Leader

The role involves leading the SRE team, enhancing developer experience through tools and processes, optimizing workflows, and delivering innovative solutions using AWS and Kubernetes.

The summary above was generated by AI

Immigration / Work Authorization Notice: At this time, iSpot does not provide visa sponsorship or immigration support for this role. Applicants must already be authorized to work in the United States on a full-time, permanent basis without the need for current or future sponsorship.

iSpot competes for the best talent. Our compensation packages consist of salary and equity in one of Seattle’s hottest start-ups, as well as other standard benefits. Most importantly, we provide a really interesting working experience, and the chance to contribute to the success of something great.

What You’ll Be Part Of:

iSpot.tv is changing how brands, agencies, and networks measure and assess the impact of TV advertising. We deal with BIG data, operating mainly in AWS with multiple Kubernetes clusters and thousands of servers. We are looking for an experienced SRE leader with the skills and passion to make a significant impact on our ecosystem. You will have a wide array of projects to tackle, with ample opportunities for growth.

You will be a key member of our SRE leadership team, focused on empowering developers to build, test, and deploy applications faster and more efficiently. You will both lead the team and remain hands-on in designing, building, and maintaining the tools, platforms, and processes that improve our engineering teams' productivity and streamline the software development lifecycle. Your work will directly impact developer happiness and the speed at which we can deliver innovative features to our customers.

Responsibilities:

We are seeking a seasoned and strategic Lead/Principal Site Reliability Engineer to drive the reliability, scalability, and performance of our core production systems while significantly enhancing the internal developer experience. This role sits at the intersection of operations and development, requiring deep technical expertise, strong leadership, and a passion for optimizing the entire software development lifecycle (SDLC).

Our team consists of senior engineers who work together with minimal supervision to attain those goals. Candidates must possess deep operational experience with AWS and Kubernetes to support teams utilizing these systems. You will lead the technical direction of the team while remaining a key individual contributor. You will be responsible for creating a culture of engineering excellence, designing self-service platforms, and fostering alignment across all engineering teams to accelerate product delivery and maintain world-class service stability.The key responsibilities are:

System Reliability and Operations (SRE Focus)

Platform Design and Management: Architect, build, and maintain scalable, highly available, and reliable cloud infrastructure in AWS leveraging modern container orchestration technologies.
Data Pipeline Reliability: Serve as the reliability and cost optimization expert for high-volume, data-intensive workloads. Focus on optimizing and ensuring the stability of distributed data processing engines, specifically Apache Spark and related ecosystems (e.g., EMR, Databricks, Glue).
Observability and Monitoring: Establish comprehensive observability practices by defining SLIs/SLOs, implementing advanced monitoring, alerting, and logging solutions to quickly identify and resolve system anomalies.
Automation: Drive automation across all operational aspects, including infrastructure provisioning (Terraform), scaling, deployment, and incident response, minimizing toil and manual effort.
Incident Management: Lead and participate in the incident response lifecycle, performing thorough post-mortems to derive actionable insights and implement preventative measures to improve system resilience.

Developer Experience and Productivity (DevEx Focus)

Platform Strategy: Design, implement, and champion self-service tools, internal developer portals, and services that empower engineering teams to manage their infrastructure and deployments independently and efficiently.
CI/CD Optimization: Own and continuously improve the CI/CD pipelines, reducing build times, streamlining deployment workflows, and integrating best practices for testing, security (Shift Left), and code quality. Maintain and improve our container orchestration and deployment tools, leveraging Kubernetes, Helm, and ArgoCD to create seamless developer workflows.
KPIs: Develop, implement, and maintain a set of key performance indicators (KPIs) to measure and improve the developer experience across all of Engineering.
Mentorship and Documentation: Guide and mentor senior engineers, promoting SRE/DevEx principles. Develop clear, comprehensive documentation and tutorials to ensure seamless adoption of new tools and platforms.
Cost and Efficiency: Strategically identify and implement opportunities for cloud cost optimization and resource efficiency without compromising reliability or performance.

III. Strategic Leadership and Cross-Team Alignment

Architecting the Roadmap: Define, champion, and communicate the long-term technical roadmap for the SRE and DevEx platforms, balancing immediate operational needs with strategic, future-state goals.
Driving Cross-Team Alignment: Act as a critical liaison between infrastructure, security, and product development teams. Proactively drive cross-team alignment on architectural standards, tooling choices, and development workflows to ensure consistency and shared accountability for system health.

Bottleneck Identification and Mitigation: Systematically identify engineering bottlenecks, friction points, and points of organizational toil within the SDLC. Implement targeted solutions—whether technical, process-based, or organizational—to mitigate these constraints and enhance overall engineering velocity.
Planning and Execution: Collaborate with engineering leadership to transform the strategic roadmap into actionable, prioritized plans, securing cross-functional buy-in and resources for successful execution.

Qualifications and Education Requirements:

Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
10+ years of relevant experience in software engineering, cloud architecture, and/or Site Reliability Engineering, with at least 3 years in a leadership or lead contributor role.
Deep expertise of AWS, including EKS, ECR, RDS, SQS/SNS, VPC, MWAA and S3.
Strong proficiency in Infrastructure as Code (IaC) tools (e.g., Terraform, CloudFormation).
Specialized experience in optimizing large-scale data platforms, specifically with Apache Spark. Proven ability to profile, troubleshoot, and tune Spark jobs for performance, cost, and reliability.
5+ years of experience with Kubernetes and containerization in general, including associated tools (kubectl, Helm, ArgoCD).
Strong knowledge of AWS cost optimization.
TCP/IP networking, including routing and AWS security groups.
Excellent knowledge of CI/CD concepts and experience developing associated pipelines in CircleCI.
Proficient in high-level scripting languages, including shell scripting, Python, and/or JavaScript.
Experience with OTel and monitoring tools such as Splunk or DataDog. Experience with native AI observability tools is a plus.
Experience with evaluating and rolling out GenAI tools for improving developer efficiency.
Excellent communication, collaboration, and stakeholder management skills, with proven experience driving technical initiatives across multiple teams.
Experience with researching and selecting new/modern developer toolsets and assisting teams in adopting them including vendor assessments, security assessments and procurement process.
Experience in Ad-Tech or “BIG Data” processing organization is highly preferred

Target cash compensation range: $163,620 - $212,710 USD Annually

We are committed to providing competitive, market-informed compensation. The cash compensation above includes base salary, variable commission for employees in eligible roles, and annual bonus targets for eligible roles. In addition to cash compensation, all full time iSpotters are eligible to participate in iSpot’s equity plan to receive stock options. Non-exempt roles will also be eligible for (pre-approved) overtime pay. Individual compensation packages are influenced by different factors unique to each candidate, including their skills, experience, qualifications and other job-related reasons.

For more information on total rewards package, go HERE

Hybrid & Flexible Workplace Policy

iSpot supports a hybrid and flexible workplace. Depending on location and work responsibilities, employees may be designated as full-time or part-time office-based or a fully remote employee. A hybrid work schedule indicates that you work in the office some days and work from home other days. The best hybrid workplaces allow for flexibility while also encouraging consistency.

Those local or living in surrounding areas to one of our offices (Bellevue, WA; El Segundo, CA; New York, NY) will work a hybrid schedule, coming into their local office 1-3 days a week. While those in a role, not office-based and located further away from our offices, will work a fully remote schedule. If you have questions regarding exact details of our hybrid & flexible workplace policy, please let your recruiter know and they will discuss with you further.

#LI-Remote

If you don't feel you met every single requirement for the role, don't rule yourself out. Please apply anyway!

iSpot is an equal opportunity employer. All applicants will receive consideration for employment without regard to race, ethnicity, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please contact our HR team.

California Residents applying for positions at iSpot can access our California Consumer Privacy Act here.

Top Skills

Argocd

AWS

Ci/Cd

Datadog

Ec2

Eks

Helm

JavaScript

Kubernetes

Python

Redshift

Shell Scripting

Splunk

Terraform

Similar Jobs

DFIN

Site Reliability Engineer

14 Days Ago

Remote or Hybrid

United States

Senior level

Artificial Intelligence • Fintech • Information Technology • Software • Data Privacy

The Principal Site Reliability Engineer ensures SaaS products are fast and stable, optimizes performance, automates deployments, and champions best practices for system operations.

Top Skills: .NetAksAnsibleAppdynamicsAzureAzure DevopsBashC#Cloud NetworkingCosmosDatadogDynatraceEksFirewallHarnessIdera Sql Diagnostic ManagerJavaJenkinsKubernetesLoad BalancingNew RelicPowershellPythonRedgate Sql MonitorSolarwinds Database Performance AnalyzerSQLTerraform

InStride

Site Reliability Engineer

16 Days Ago

Easy Apply

In-Office or Remote

Los Angeles, CA, USA

Easy Apply

165K-185K Annually

Expert/Leader

165K-185K Annually

Expert/Leader

Edtech • Enterprise Web • HR Tech • Social Impact • Software

The Principal Site Reliability Engineer will design and optimize AWS environments, lead technical initiatives, mentor engineers, and enhance platform reliability through cloud architecture and automation, focusing on security, compliance, and operational excellence.

Top Skills: AWSAws CdkAws EksBashCloudFormationGoGrafanaHelmKubernetesPrometheusPythonTerraformTypescript

Blue River Technology

Site Reliability Engineer

12 Hours Ago

Remote

166K-293K

Senior level

166K-293K

Senior level

Artificial Intelligence • Software

As a Principal Site Reliability Engineer, you will design hybrid infrastructure, integrate edge devices and cloud resources, optimize performance and costs, and collaborate with cross-functional teams to ensure robust systems.

Top Skills: AWSGoKubernetesLinuxPythonTerraformTerragrunt

What you need to know about the Charlotte Tech Scene

Ranked among the hottest tech cities in 2024 by CompTIA, Charlotte is quickly cementing its place as a major U.S. tech hub. Home to more than 90,000 tech workers, the city’s ecosystem is primed for continued growth, fueled by billions in annual funding from heavyweights like Microsoft and RevTech Labs, which has created thousands of fintech jobs and made the city a go-to for tech pros looking for their next big opportunity.

Key Facts About Charlotte Tech

Number of Tech Workers: 90,859; 6.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Lowe’s, Bank of America, TIAA, Microsoft, Honeywell
Key Industries: Fintech, artificial intelligence, cybersecurity, cloud computing, e-commerce
Funding Landscape: $3.1 billion in venture capital funding in 2024 (CED)
Notable Investors: Microsoft, Google, Falfurrias Management Partners, RevTech Labs Foundation
Research Centers and Universities: University of North Carolina at Charlotte, Northeastern University, North Carolina Research Campus