NVIDIA Logo

NVIDIA

Senior Site Reliability Engineer, HPC and LSF

Posted 2 Days Ago
Be an Early Applicant
In-Office
3 Locations
184K-288K
Expert/Leader
In-Office
3 Locations
184K-288K
Expert/Leader
As a Senior Site Reliability Engineer, you will manage HPC workload schedulers, automate processes, troubleshoot systems, and collaborate to improve infrastructure for silicon development.
The summary above was generated by AI

NVIDIA is the leader in AI, machine learning and datacenter acceleration. NVIDIA is expanding that leadership into datacenter networking with ethernet switches, NICs and DPUs NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life’s work, to amplify human imagination and intelligence. Make the choice, join our diverse team today!
 

As a member of the Hardware Infrastructure Farm team, you will provide leadership in the design and implementation of ground breaking compute clusters that powers all silicon development across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve engineer's productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you’ll be doing:

  • Manage and support workload and resource schedulers in a large-scale HPC environment.

  • Automate Everything: Develop automation scripts to automate deployment, configuration management, and operational monitoring.

  • Develop solutions for complex computing resource management requirements.

  • Extract and leverage grid performance metrics for troubleshooting and performance optimization.

  • Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.

  • Develop, define and document standard methodologies to share with internal teams.

  • Collaborate with domain experts to improve how our chip development process utilizes our infrastructure.

  • Directly contribute to the overall quality and improve time to market for our next generation chips.

What we need to see:

  • Extensive knowledge with job scheduler administration (e.g. IBM Spectrum LSF or SLURM).

  • Proficient in administering Centos/RHEL Linux distributions.

  • In depth understating of container technologies like Docker.

  • Proficiency in UNIX scripting languages and Python.

  • Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions.

  • Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals.

  • 10+ years experience in a large, distributed Linux environment.

  • BS in Computer Science, similar degree or equivalent experience.

Ways to stand out from the crowd:

  • Experience analyzing and tuning performance for a variety of HPC or EDA workloads.

  • Solid understanding of cluster configuration managements tools such as Ansible.

  • Proficiency in Perl for maintaining legacy automation scripts.

  • Deep understanding of distributed system principles.
    #LI-Hybrid

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until July 29, 2025.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Top Skills

Ansible
Centos
Docker
Hpc
Lsf
Perl
Python
Rhel
Unix Scripting

Similar Jobs

11 Hours Ago
Hybrid
Arlington, TX, USA
Entry level
Entry level
Fintech • Financial Services
The Associate Personal Banker will engage with customers, assist with account openings and service requests, and build relationships to help clients succeed financially. Requires compliance with banking regulations.
11 Hours Ago
Hybrid
Colleyville, TX, USA
28-55
Mid level
28-55
Mid level
Fintech • Financial Services
Responsible for leading a diverse team in a branch, focusing on customer engagement, operational excellence, and achieving business objectives. This includes coaching staff and resolving customer issues, as well as ensuring compliance with regulations.
11 Hours Ago
Hybrid
El Paso, TX, USA
Entry level
Entry level
Fintech • Financial Services
Wells Fargo seeks a Bilingual Personal Banker to deliver customer service, assist with account services, and promote bank products while meeting regulatory compliance.

What you need to know about the Charlotte Tech Scene

Ranked among the hottest tech cities in 2024 by CompTIA, Charlotte is quickly cementing its place as a major U.S. tech hub. Home to more than 90,000 tech workers, the city’s ecosystem is primed for continued growth, fueled by billions in annual funding from heavyweights like Microsoft and RevTech Labs, which has created thousands of fintech jobs and made the city a go-to for tech pros looking for their next big opportunity.

Key Facts About Charlotte Tech

  • Number of Tech Workers: 90,859; 6.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Lowe’s, Bank of America, TIAA, Microsoft, Honeywell
  • Key Industries: Fintech, artificial intelligence, cybersecurity, cloud computing, e-commerce
  • Funding Landscape: $3.1 billion in venture capital funding in 2024 (CED)
  • Notable Investors: Microsoft, Google, Falfurrias Management Partners, RevTech Labs Foundation
  • Research Centers and Universities: University of North Carolina at Charlotte, Northeastern University, North Carolina Research Campus

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account