Senior Machine Learning Engineer - Hardware Abstractions & Performance Optimization

Sorry, this job was removed at 08:04 p.m. (EST) on Thursday, Jul 24, 2025

Remote

Hiring Remotely in United States

220K-300K Annually

Remote

Hiring Remotely in United States

220K-300K Annually

Similar Jobs

Deepgram

Solutions Architect

27 Minutes Ago

In-Office or Remote

165K-220K Annually

Mid level

165K-220K Annually

Mid level

Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Conversational AI

As a Solutions Architect at Deepgram, you will engage with customers, solve technical issues, enhance support solutions, and develop automation processes.

Top Skills: AWSAzureC/C++DockerGCPJavaScriptKubernetesPythonRustTypescript

Deepgram

Solutions Architect

27 Minutes Ago

Remote

USA

140K-200K Annually

Mid level

140K-200K Annually

Mid level

Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Conversational AI

As a Solutions Architect at Deepgram, you'll engage with customers post-sales to resolve complex technical issues, create scalable solutions, and enhance support processes, using your engineering skills and experience in customer-focused roles.

Top Skills: AWSAzureC/C++DockerGCPJavaScriptKubernetesPythonRustTypescript

Forward Financing

Instructional Designer

29 Minutes Ago

Remote

United States

80K-95K Annually

Mid level

80K-95K Annually

Mid level

Fintech • Financial Services

The role involves designing leadership curricula, optimizing onboarding, creating learning materials for synchronous training, video production, and maintaining an LMS. It focuses on innovative instructional design and AI integration for employee training and development.

Top Skills: Adobe IllustratorAdobe PhotoshopAdobe PremiereArticulate Storyline 360CamtasiaCanvaMs TeamsRise 360WebexZoom

Luma’s mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for systems that can see and understand, show and explain, and eventually interact with our world to effect change.

We are looking for engineers with significant experience maintaining & designing highly efficient systems and code that can be optimized to run on multiple hardware platforms, bringing our state-of-the-art models to as many people at the best performance per dollar.

Responsibilities

Ensure efficient implementation of models & systems with a focus on designing, maintaining, and writing abstractions that scale beyond NVIDIA/CUDA hardware.
Identify and remedy efficiency bottlenecks (memory, speed, utilization, communication) by profiling and implementing high-performance PyTorch code, deferring to Triton or similar kernel-level languages as necessary.
Benchmarking our products across a variety of hardware & software to help the product team understand the optimal tradeoffs between latency, throughput and cost at various degrees of parallelism.
Work together with our partners to help them identify bottlenecks and push forward new iterations of hardware and software.
Work closely together with the rest of the research team to ensure systems are planned to be as efficient as possible from start to finish and raise potential issues for hardware integration.

Must have experience

Experience optimizing for memory, latency and throughput in Pytorch.
- Bonus: experience with non-NVIDIA systems
Experience using torch.compile / torch.XLA.
Experience benchmarking and profiling GPU & CPU code in Pytorch for optimal device utilization (examples: torch profiler, memory profilers, trace viewers, custom tooling).
Experience building tools & abstractions to ensure models run optimally on different hardware and software stacks .
Experience working with transformer models and attention implementations.
Experience with parallel inference, particularly with tensor parallelism, pipeline parallelism.

Good to have experience

Experience with high-performance Triton/CUDA and writing custom PyTorch kernels and ops. Top candidates will be able to write fused kernels for common hot paths, understand when to make use of lower level features like tensor cores or warp intrinsics, and will understand where these tools can be most impactful.
Experience writing high-performance parallel C++. Bonus if done within an ML context with PyTorch, like for data loading, data processing, inference code
Experience building inference / demo prototype code (incl. Gradio, Docker etc.)

What you need to know about the Charlotte Tech Scene

Ranked among the hottest tech cities in 2024 by CompTIA, Charlotte is quickly cementing its place as a major U.S. tech hub. Home to more than 90,000 tech workers, the city’s ecosystem is primed for continued growth, fueled by billions in annual funding from heavyweights like Microsoft and RevTech Labs, which has created thousands of fintech jobs and made the city a go-to for tech pros looking for their next big opportunity.

Key Facts About Charlotte Tech

Number of Tech Workers: 90,859; 6.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Lowe’s, Bank of America, TIAA, Microsoft, Honeywell
Key Industries: Fintech, artificial intelligence, cybersecurity, cloud computing, e-commerce
Funding Landscape: $3.1 billion in venture capital funding in 2024 (CED)
Notable Investors: Microsoft, Google, Falfurrias Management Partners, RevTech Labs Foundation
Research Centers and Universities: University of North Carolina at Charlotte, Northeastern University, North Carolina Research Campus

Luma AI

Senior Machine Learning Engineer - Hardware Abstractions & Performance Optimization

Similar Jobs

Solutions Architect

Solutions Architect

Instructional Designer

What you need to know about the Charlotte Tech Scene

Key Facts About Charlotte Tech