NVIDIA Logo

NVIDIA

Principal Cloud Services Software Engineer

Posted 2 Days Ago
In-Office or Remote
Hiring Remotely in Santa Clara, CA
272K-431K Annually
Expert/Leader
In-Office or Remote
Hiring Remotely in Santa Clara, CA
272K-431K Annually
Expert/Leader
As a Principal Cloud Services Software Engineer, you'll enhance AI infrastructure, develop scalable services, and ensure high resiliency for NVIDIA's cloud platforms.
The summary above was generated by AI

Joining NVIDIA's DGX Cloud Team means contributing to the infrastructure that powers our innovative AI research. This team focuses on optimizing efficiency and resiliency of ML workloads, as well as developing scalable AI infrastructure tools and services. Our objective is to deliver a stable, scalable environment for AI researchers, providing them with the vital resources and scale to champion innovation.

We are seeking a distributed software engineer to join our team! As a Principal Engineer, you'll be instrumental in developing and optimizing AI infrastructure services to bring peak AI performance and high resiliency for DGX Cloud. Your expertise in cloud services software architecture that drives the full resilience stack that will deliver an AI platform that redefines what is possible. This is an exceptional opportunity to push the boundaries of technology and shape the future of AI and the Cloud, and work with a world-class team of like-minded engineers.

What You Will Be Doing:

As a software engineer specializing in backend development, you'll work in a dedicated team to enhance the infrastructure and products that underpin NVIDIA's AI platforms. Your work will be essential in enabling innovative AI research, focusing on:

  • Developing solutions at the intersection of machine learning, distributed systems, and high-performance computing, supplying to the advancement of AI technologies.

  • Designing, developing, and optimizing (micro-)services orchestrated by Kubernetes to provide large-scale AI training workflows on AI training supercomputers located at major CSPs, with resiliency and efficiency.

  • Co-designing and implementing the APIs that allow these services to integrate vertically with NVIDIA's resiliency stacks, ranging from tier-0 telemetry services to break/fix automation services to checkpoint and execution systems.

  • Crafting a submission abstraction that enables model engineers and training platforms/frameworks to seamlessly submit long-running training jobs while hiding the complexity of handling infrastructure failures, running job lifecycles with auto-restarts on failure, ensuring full efficiency, and promptly advising users.

  • Crafting these services to be modular, enabling them to be coordinated with and deployed onto on-premises AI clusters that apply NVIDIA Hardware and Cloud services.

What We Need To See:

  • Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).

  • 15+ years of hands-on experience in backend development, preferably with Python, Go, C/C++, or similar high-performance languages.

  • Consistent track record of building and scaling large-scale distributed systems.

  • Experience with cloud computing platforms such as AWS, Azure, and GCP, as well as container technologies like Docker and Kubernetes, and HPC/AI platforms such as Slurm.

Ways to stand out from the crowd:

  • Real world experience in DL frameworks, orchestrators like PyTorch, TensorFlow, JAX, and Ray

  • Experience in developing a framework plugin architecture that allows the framework to be integrated with the cluster scheduler visibly to the users

  • Strong understanding of NVIDIA GPUs, network technologies, and their failure patterns.

  • Experience with AI models and AI based tools.

  • Provide references to your code contributions.

NVIDIA leads the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions, from artificial intelligence to autonomous cars. NVIDIA is looking for exceptional people like you to help us accelerate the next wave of artificial intelligence!

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 431,250 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until May 1, 2026.

This posting is for an existing vacancy. 

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

52 Minutes Ago
In-Office or Remote
153K-217K Annually
Senior level
153K-217K Annually
Senior level
Cloud • Information Technology • Productivity • Security • Software • App development • Automation
As an Account Executive for the Public Sector, you will manage customer relationships, drive enterprise sales, and orchestrate support teams to enhance customer migration to FedRAMP cloud solutions.
Top Skills: Analytic ToolsCRMPipeline Management
57 Minutes Ago
Remote or Hybrid
United States
157K-264K Annually
Senior level
157K-264K Annually
Senior level
Artificial Intelligence • Cloud • Sales • Security • Software • Cybersecurity • Data Privacy
Lead and develop scalable customer success operations, define KPIs, enhance customer experience, and manage cross-functional initiatives to drive revenue growth.
Top Skills: Bi ToolsCpqExcelGainsightLookerSalesforceSQLTableau
An Hour Ago
Easy Apply
Remote
United States
Easy Apply
139K-192K Annually
Mid level
139K-192K Annually
Mid level
Artificial Intelligence • Fintech • Machine Learning • Social Impact • Software
As a Product Designer, you will create user experiences for borrower-facing flows, translate product requirements into designs, collaborate with teams, and iterate on feedback to impact borrower interactions positively.
Top Skills: Figma

What you need to know about the Charlotte Tech Scene

Ranked among the hottest tech cities in 2024 by CompTIA, Charlotte is quickly cementing its place as a major U.S. tech hub. Home to more than 90,000 tech workers, the city’s ecosystem is primed for continued growth, fueled by billions in annual funding from heavyweights like Microsoft and RevTech Labs, which has created thousands of fintech jobs and made the city a go-to for tech pros looking for their next big opportunity.

Key Facts About Charlotte Tech

  • Number of Tech Workers: 90,859; 6.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Lowe’s, Bank of America, TIAA, Microsoft, Honeywell
  • Key Industries: Fintech, artificial intelligence, cybersecurity, cloud computing, e-commerce
  • Funding Landscape: $3.1 billion in venture capital funding in 2024 (CED)
  • Notable Investors: Microsoft, Google, Falfurrias Management Partners, RevTech Labs Foundation
  • Research Centers and Universities: University of North Carolina at Charlotte, Northeastern University, North Carolina Research Campus

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account