NVIDIA Logo

NVIDIA

Director, Software Engineering - DGX Cloud Infrastructure

Posted 7 Days Ago
In-Office or Remote
5 Locations
284K-426K
Expert/Leader
In-Office or Remote
5 Locations
284K-426K
Expert/Leader
Lead the engineering organization focused on GPU-accelerated cloud infrastructure, ensuring automation and operational excellence while collaborating with internal and external partners.
The summary above was generated by AI

NVIDIA is seeking a strategic and technically grounded Director of Engineering to lead a high-impact organization at the intersection of core compute cloud infrastructure for AI factories. This organization is a key pillar in NVIDIA’s DGX Cloud ecosystem, building shared automation and reliability tooling that enables a sizable portion of our GPU-accelerated compute fleet.

You will further develop and scale an organization of engineers focused on running production software for large scale GPU-accelerated infrastructure. This organization partners closely with storage, networking, and several other teams across NVIDIA. You will be the engineering leader responsible for interfacing with some of our NVIDIA Cloud Partners to continuously meet our production excellence goals.

What You’ll Be Doing:

  • Build and grow a team of software engineers and leaders focused on automating day 0, 1, and 2 for large-scale GPU clusters running on bare metal and public clouds with service levels of various kinds.

  • Lead the design and continuous delivery of shared automation frameworks aligned with SLOs and error budgets.

  • Liaise with some of our NVIDIA Cloud Partners to ensure aligned priorities and sustained production excellence.

  • Drive clarity and execution through high ambiguity, translating broad, and ever evolving objectives into iterative delivery milestones.

  • Enable internal teams by reducing operational friction and improving automation coverage across the stack.

What We Need To See:

  • Proven experience leading software engineering teams (incl. SRE and/or DevOps) responsible for infrastructure automation, and distributed systems.

  • Demonstrated ability to build software engineering organizations, driving continuous incremental execution across teams, and operate effectively in highly ambiguous environments with ever evolving objectives.

  • Hands-on experience designing, running, or automating cloud infrastructure atop bare metal platforms and/or VMs.

  • Experience deploying cloud-native services on public clouds.

  • Track record of representing your company or division in external partnerships with public clouds, infrastructure vendors, and to internal partner teams.

  • Strong foundation in incremental delivery, and technical program execution.

  • Excellent written and verbal communication skills, with the ability to influence across levels and disciplines.

  • Bachelor of Science (or equivalent experience) or Master of Science degree in Computer Science or related field, with a minimum of 10+ overall years of experience developing and leading cloud infrastructure teams, and 5+ yrs of management experience

Ways to stand out from the crowd:

  • Relevant experience developing organizations at public cloud companies. Background leading teams running large-scale GPU clusters. Familiarity with technologies like Linux, NVIDIA BCM, Slurm, Infiniband, Kubernetes, Slurm, distributed storage, or BlueField DPUs.

  • Experience developing both internal-facing platform teams and customer-facing infrastructure as a service ones.

  • Track record of collaboration with security, or compliance teams including in regulated environments. Familiarity with AI/ML platform workloads and their reliability or performance characteristics.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative, hard-working and self-motivated, we want to hear from you! NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. NVIDIA leads the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing (HPC) and Visualization. DGX Cloud provides a serverless generative AI infrastructure to the world enabling NVIDIA’s AI supercomputer technologies to be used by anyone.

The base salary range is 284,000 USD - 425,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Top Skills

Bluefield Dpus
Distributed Storage
Infiniband
Kubernetes
Linux
Nvidia Bcm
Slurm

Similar Jobs

29 Minutes Ago
Remote
Hybrid
5 Locations
187K-240K Annually
Senior level
187K-240K Annually
Senior level
Artificial Intelligence • Cloud • Software • Cybersecurity
Senior Software Engineer responsible for designing and building database monitoring tools, improving query performance, and contributing to open-source projects in the Postgres ecosystem.
Top Skills: C++GoGrpcKafkaPostgresPython
4 Hours Ago
Easy Apply
Remote
United States
Easy Apply
169K-240K
Senior level
169K-240K
Senior level
Big Data • Fintech • Mobile • Payments • Financial Services
As a Senior Software Engineer, you will lead engineers in delivering high-availability systems, collaborate with stakeholders, and develop talent within your team while ensuring quality and ownership in code standards.
Top Skills: AWSKotlinKubernetesMySQLPython
4 Hours Ago
Easy Apply
Remote
United States
Easy Apply
142K-210K
Junior
142K-210K
Junior
Big Data • Fintech • Mobile • Payments • Financial Services
The Software Engineer II at Affirm will develop and support scalable APIs, collaborate with teams, and enhance merchant risk assessment processes.
Top Skills: AWSKotlinKubernetesMySQLPython

What you need to know about the Charlotte Tech Scene

Ranked among the hottest tech cities in 2024 by CompTIA, Charlotte is quickly cementing its place as a major U.S. tech hub. Home to more than 90,000 tech workers, the city’s ecosystem is primed for continued growth, fueled by billions in annual funding from heavyweights like Microsoft and RevTech Labs, which has created thousands of fintech jobs and made the city a go-to for tech pros looking for their next big opportunity.

Key Facts About Charlotte Tech

  • Number of Tech Workers: 90,859; 6.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Lowe’s, Bank of America, TIAA, Microsoft, Honeywell
  • Key Industries: Fintech, artificial intelligence, cybersecurity, cloud computing, e-commerce
  • Funding Landscape: $3.1 billion in venture capital funding in 2024 (CED)
  • Notable Investors: Microsoft, Google, Falfurrias Management Partners, RevTech Labs Foundation
  • Research Centers and Universities: University of North Carolina at Charlotte, Northeastern University, North Carolina Research Campus

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account