Lambda Logo

Lambda

Super Intelligence HPC Support Engineer

Posted 15 Days Ago
Be an Early Applicant
Remote
Hiring Remotely in USA
160K-206K Annually
Senior level
Remote
Hiring Remotely in USA
160K-206K Annually
Senior level
As a Super Intelligence HPC Support Engineer, you'll manage incidents for hyperscale GPU clusters, ensuring reliability and performance, and collaborating with engineering teams.
The summary above was generated by AI

Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU.


If you'd like to build the world's best deep learning cloud, join us. 


About this role:

As a Super Intelligence HPC Support Engineer, you’ll be part of a specialized team dedicated to Lambda’s most strategic and complex customers — organizations operating hyperscale GPU clusters and pushing the boundaries of AI/ML at unprecedented scale.

You’ll serve as a technical expert and trusted partner, ensuring their environments remain reliable, performant, and ready for mission-critical workloads. This role requires deep expertise in HPC and cluster orchestration, the ability to navigate complex incidents with precision, and the judgment to know when and how to engage engineering, data center, and product teams.

This is a customer-facing engineering role where the stakes are high: downtime has real business impact, and your expertise directly influences trust with some of the largest AI companies in the world.

What You’ll Do

  • Act as the primary technical point of escalation for Super Intelligence customers running hyperscale GPU clusters.

  • Lead incident response for complex issues, ensuring rapid triage, clear communication, and timely resolution.

  • Proactively identify risks in large environments (firmware, performance bottlenecks, orchestration issues) and drive preventative improvements.

  • Partner closely with Lambda Engineering and Product teams to influence roadmap decisions based on real customer needs.

  • Contribute to runbooks, best practices, and operational guides tailored for hyperscale environments.

  • Train and mentor other support engineers, raising the bar across Lambda’s support organization.

  • Participate in a rotating on-call schedule, owning critical incidents and high-priority alerts for SI customers.

You

  • 7+ years of experience in HPC or cloud support engineering, with customer-facing responsibilities.

  • Proven experience managing large-scale Linux clusters and distributed HPC/AI workloads.

  • Deep expertise in orchestration tools such as Kubernetes and/or Slurm.

  • Strong knowledge of GPU technologies (CUDA, NCCL, MIG, NVLink, GPUDirect RDMA).

  • Skilled in high-throughput networking (InfiniBand, RoCE) and cluster storage solutions.

  • Familiarity with monitoring/logging platforms (Prometheus, Grafana, Datadog).

  • Experience leading incident management and communicating directly with enterprise or hyperscale customers.

  • Ability to balance deep technical troubleshooting with clear, concise communication to executives and stakeholders.

Nice to Have

  • Python automation experience (venv, conda, pyenv).

  • Certifications in NVIDIA or InfiniBand technologies.

  • Familiarity with infrastructure-as-code tools (Terraform, Ansible, Puppet, Chef).

  • Hands-on experience with storage providers and technologies (VAST, CEPH, Lustre, Weka, DDN).

  • Experience operating in high-availability, 24/7 environments.

Salary Range Information

This is a salaried non-exempt role, eligible for overtime. The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.

About Lambda

  • Founded in 2012, ~400 employees (2025) and growing fast

  • We offer generous cash & equity compensation

  • Our investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.

  • We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability

  • Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG

  • Health, dental, and vision coverage for you and your dependents

  • Wellness and Commuter stipends for select roles

  • 401k Plan with 2% company match (USA employees)

  • Flexible Paid Time Off Plan that we all actually use

A Final Note:

You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.

Equal Opportunity Employer

Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

Top Skills

Ansible
Ceph
Chef
Cuda
Datadog
Ddn
Gpu
Gpudirect Rdma
Grafana
Hpc
Infiniband
Kubernetes
Lustre
Mig
Nccl
Nvlink
Prometheus
Puppet
Python
Roce
Slurm
Terraform
Vast
Weka

Similar Jobs

28 Minutes Ago
Remote
United States
140K-170K Annually
Senior level
140K-170K Annually
Senior level
eCommerce • Healthtech • Mobile • Wearables
The National Accounts Manager is responsible for driving sales performance, managing key enterprise customer accounts, and achieving sales targets in the audio and video conferencing product portfolio across North America.
Top Skills: Customer Relationship Management SystemsMS Office
28 Minutes Ago
Remote
US
120K-170K Annually
Senior level
120K-170K Annually
Senior level
eCommerce • Healthtech • Mobile • Wearables
Manage and grow partnerships within government, education, and healthcare sectors. Drive revenue growth and ensure product positioning in territory. Responsible for sales performance and achieving targets, customer relationship management, and product support.
Top Skills: Customer Relationship Management SystemsMS Office
28 Minutes Ago
Remote
US
160K-220K Annually
Senior level
160K-220K Annually
Senior level
eCommerce • Healthtech • Mobile • Wearables
Lead global strategic partnerships and business development for Jabra, focusing on ecosystem partnerships, go-to-market strategies, and driving revenue growth.
Top Skills: A/VAICcEngineering TeamsTechnology IntegrationsUc

What you need to know about the Charlotte Tech Scene

Ranked among the hottest tech cities in 2024 by CompTIA, Charlotte is quickly cementing its place as a major U.S. tech hub. Home to more than 90,000 tech workers, the city’s ecosystem is primed for continued growth, fueled by billions in annual funding from heavyweights like Microsoft and RevTech Labs, which has created thousands of fintech jobs and made the city a go-to for tech pros looking for their next big opportunity.

Key Facts About Charlotte Tech

  • Number of Tech Workers: 90,859; 6.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Lowe’s, Bank of America, TIAA, Microsoft, Honeywell
  • Key Industries: Fintech, artificial intelligence, cybersecurity, cloud computing, e-commerce
  • Funding Landscape: $3.1 billion in venture capital funding in 2024 (CED)
  • Notable Investors: Microsoft, Google, Falfurrias Management Partners, RevTech Labs Foundation
  • Research Centers and Universities: University of North Carolina at Charlotte, Northeastern University, North Carolina Research Campus

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account