Akamai Technologies Logo

Akamai Technologies

Senior Site Reliability Engineer

Posted Yesterday
Be an Early Applicant
In-Office or Remote
Hiring Remotely in United States
121K-219K Annually
Senior level
In-Office or Remote
Hiring Remotely in United States
121K-219K Annually
Senior level
Lead reliability, automation, and observability for high-density AI hardware infrastructure. Build Python-based IaC tooling, telemetry pipelines, Prometheus/Grafana dashboards, and AI-assisted tooling. Run 24x7 incident response, coordinate vendors and field technicians, define operational readiness, and drive post-mortems to improve uptime and performance.
The summary above was generated by AI

Do you enjoy collaborating with teams to solve complex challenges?

Do you enjoy solving large scale distributed content delivery challenges?

Join our critical AI Hardware SRE Team!

The AI Hardware SRE team is responsible for overseeing, scaling, and optimizing our next-generation dedicated AI hardware infrastructure. You will be responsible for ensuring best-in-class uptime and reliability of our AI hardware infrastructure offerings.

Partner with the best

In this role, you'll play a part in pioneering the reliability an elite, high-density hardware and software infrastructure spanning the globe. You'll collaborate with product teams from the earliest stages of development to ensure the reliability, scalability, and performance of our systems. You'll define key performance indicators and defend them when they are breached.

As a Senior Site Reliability Engineer, you will be responsible for:

  • Developing and scaling robust programmatic tooling and infrastructure-as-code utilities in Python to eliminate operational toil and automate fleet-wide provisioning.
  • Integrating automated workflows across disconnected corporate ticketing systems to optimize time-to-mitigate metrics for hardware and network break-fix events.
  • Leveraging advanced AI utilities and LLM-assisted development paradigms where appropriate to accelerate technical execution, script authorship, and system analysis
  • Working on cutting-edge private cloud and compute technologies to improve the availability, latency, and overall systemic health of high-density hardware environments.
  • Designing and implementing telemetry pipelines, custom Prometheus/Grafana monitoring dashboards, and AI-based anomaly detection tailored for bare-metal and virtualized environments.
  • Participating in 24x7x365 on-call rotations, spearheading real-time incident management, and managing high-severity service disruption protocols via automated PagerDuty and Slack workflows.
  • Partnering directly with third-party infrastructure vendors and coordinating on-site field technicians to facilitate uptime activities.

Do what you love

To be successful in this role you will:

  • Have 5 years of relevant experience and a Bachelor's degree in Computer Engineering, Computer Science or equivalent
  • Possess tooling and coding ability in languages like Python to construct scalable operational tools, API integrations, and automation frameworks.
  • Show hands-on experience with modern observability stacks and timeseries engines, like Prometheus, Grafana, OpenTelemetry, and Loki.
  • Possess a working understanding of advanced networking topologies, high-bandwidth routing/switching infrastructure, BGP, and dual-stack IPv4/IPv6 networks.
  • Have experience acting as a key designer for new service rollouts, including establishing operational readiness criteria, telemetry baselines, and alerting thresholds.
  • Demonstrate extensive experience building technical runbooks, leading complex incident response bridges, and driving comprehensive, blameless post-mortems.
  • Display a proven ability to take absolute ownership of ambiguous technical problems, coordinate cross-functional teams, and drive for production-grade solutions.

About us

At Akamai, we make life better for billions of people, trillions of times a day.
Whether you're streaming live events, scrolling social media, watching your favorite series, or managing your savings, we're the engine behind the scenes. We provide the world's most distributed platform from Cloud to Edge to help the giants of the digital world work faster and stay more secure, making the internet a better experience for everyone.
Our focus is simple:
Cloud and Edge: Running apps closer to users for instant performance.
Security: Neutralizing threats before they ever reach your data.
Content Delivery: Scaling the world's biggest moments without a glitch.
AI: Enabling our customers to build, secure, and scale AI apps on the world's most distributed cloud platform.
At Akamai, we don't just support the internet; we power and protect it, because behind every great digital experience is a massive hidden challenge. And we're the ones who solve it. When millions of people hit play or pay, Akamai ensures it just works.

Benefits at Akamai: We support your health, well-being, finances, and life beyond work. See our benefits.

FlexBase adapts to your job's needs

Akamai's FlexBase program is yet another way we show our commitment to providing employees with an exceptional workplace experience. It's not about telling employees where to work; it's about supporting employees to do their best work.
We trust our incredible employees to work in ways that suit them best: at home, in an office, or a combination of both.

Connect with us on social and see what life at Akamai is like!

Compensation

Akamai is committed to fair and equitable compensation practices. For US based candidates only - the base salary for this position ranges from $121,400 - $218,600/year; a candidate’s salary is determined by various factors including, but not limited to, relevant work experience, skills, certifications and location. Compensation for candidates outside the US will vary. The compensation package may also include incentive compensation opportunities in the form of annual bonus or incentives, equity awards and an Employee Stock Purchase Plan (ESPP). Akamai provides industry-leading benefits including healthcare, 401K savings plan, company holidays, vacation (in the form of PTO), sick time, family friendly benefits including parental leave and an employee assistance program including a focus on mental and financial wellness; Eligibility requirements apply.

Similar Jobs

20 Days Ago
Easy Apply
Remote
United States
Easy Apply
130K-140K Annually
Senior level
130K-140K Annually
Senior level
Artificial Intelligence • Consumer Web • Digital Media • Information Technology • Social Impact • Software
Lead SRE work to keep Circle highly available and performant: respond to incidents, own monitoring/alerting/log management, manage and optimize MySQL/Postgres/ClickHouse/Redis databases, maintain server infrastructure and deployment pipelines, collaborate with engineering teams, and build internal SRE tooling and automation.
Top Skills: AWSClickhouseKubernetesLlm-Based Tools (Copilots)MySQLPostgresRedis
21 Days Ago
Easy Apply
Remote
USA
Easy Apply
186K-219K Annually
Senior level
186K-219K Annually
Senior level
Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3
Own reliability, automation, and DevOps for Coinbase's corporate IAM platform: on-call/incident response, CI/CD and IaC pipelines, identity lifecycle tooling, observability and disaster recovery, documentation, and cross-team IAM advisement to ensure secure, scalable access for a global workforce.
Top Skills: AbacAuth0AWSAzureC#Ci/CdContainer OrchestrationDuoEntraidGCPGenerative AiGitGoIacJavaMfaOktaPingPythonRbacRubySsoTerraform
21 Days Ago
Easy Apply
Remote
USA
Easy Apply
186K-219K Annually
Senior level
186K-219K Annually
Senior level
Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3
Senior SRE on the IT Operations team owning reliability, monitoring, and incident response for AI infrastructure. Build automation, CI/CD and Kubernetes tooling, improve observability and documentation, and develop internal full-stack tools using Go or Python. Partner with Infrastructure, Security, and Compliance to scale secure, resilient AI deployment pipelines.
Top Skills: AnsibleAWSBashChefCi/CdDockerEc2GitGoKubernetesLinuxPuppetPythonRubySaltTerraform

What you need to know about the Charlotte Tech Scene

Ranked among the hottest tech cities in 2024 by CompTIA, Charlotte is quickly cementing its place as a major U.S. tech hub. Home to more than 90,000 tech workers, the city’s ecosystem is primed for continued growth, fueled by billions in annual funding from heavyweights like Microsoft and RevTech Labs, which has created thousands of fintech jobs and made the city a go-to for tech pros looking for their next big opportunity.

Key Facts About Charlotte Tech

  • Number of Tech Workers: 90,859; 6.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Lowe’s, Bank of America, TIAA, Microsoft, Honeywell
  • Key Industries: Fintech, artificial intelligence, cybersecurity, cloud computing, e-commerce
  • Funding Landscape: $3.1 billion in venture capital funding in 2024 (CED)
  • Notable Investors: Microsoft, Google, Falfurrias Management Partners, RevTech Labs Foundation
  • Research Centers and Universities: University of North Carolina at Charlotte, Northeastern University, North Carolina Research Campus

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account