BentoML Logo

BentoML

Inference Optimization Engineer

Posted 4 Days Ago
Be an Early Applicant
Remote
3 Locations
Mid level
Remote
3 Locations
Mid level
As an Inference Optimization Engineer, you'll enhance the performance of large language models by optimizing inference processes on GPUs, profiling workloads, and sharing insights with the community.
The summary above was generated by AI
About BentoML

BentoML is a leading inference platform provider that helps AI teams run large language models and other generative AI workloads at scale. With support from investors such as DCM, enterprises around the world rely on us for consistent scalability and performance in production. Our portfolio includes both open source and commercial products, and our goal is to help each team build its own competitive advantage through AI.

Role

As an Inference Optimization Engineer, you will improve the speed and efficiency of large language models at the GPU kernel level, through the inference engine, and across distributed architectures. You will profile real workloads, remove bottlenecks, and lift each layer of the stack to new performance ceilings. Every gain you unlock will flow straight into open source code and power fleets of production models, cutting GPU costs for teams around the world. By publishing blog posts and giving conference talks you will become a trusted voice on efficient LLM inference at large scale.

Example projects:

  • https://bentoml.com/blog/structured-decoding-in-vllm-a-gentle-introduction

  • https://bentoml.com/blog/benchmarking-llm-inference-backends

  • https://bentoml.com/blog/25x-faster-cold-starts-for-llms-on-kubernetes

Responsibilities
  • Latency & throughput - Identify bottlenecks and optimize inference efficiency in single-GPU, multi-GPU, and multi-node serving setups.

  • Benchmarking - Build repeatable tests that model production traffic; track and report vLLM, SGLang, TRT-LLM, and future runtimes.

  • Resource efficiency - Reduce memory use and compute cost with mixed precision, better KV-cache handling, quantization, and speculative decoding.

  • Serving features - Improve batching, caching, load balancing, and model-parallel execution.

  • Knowledge sharing - Write technical posts, contribute code, and present findings to the open-source community.

Qualifications
  • Deep understanding of transformer architecture and inference engine internals.

  • Hands-on experience speeding up model serving through batching, caching, load balancing.

  • Experienced with inference engines such as vLLM, SGLang, or TRT-LLM (upstream contributions are a plus).

  • Experienced with inference optimization techniques: quantization, distillation, speculative decoding, or similar.

  • Proficiency in CUDA and use of profiling tools like Nsight, nvprof, or CUPTI. Proficiency in Triton and ROCm is a bonus.

  • Track record of blog posts, conference talks, or open-source projects in ML systems is a bonus.

Why join us
  • Direct impact – optimize distributed LLM inference and large GPU clusters worldwide and cut real GPU costs.

  • Technical scope – operate distributed LLM inference and large GPU clusters worldwide.

  • Customer reach – support organizations around the globe that rely on BentoML.

  • Influence – mentor teammates, guide open-source contributors, and become a go-to voice on efficient inference in the community.

  • Remote work – work from where you are most productive and collaborate with teammates in North America and Asia.

  • Compensation – competitive salary, equity, learning budget, and paid conference travel.

Top Skills

Cuda
Rocm
Sglang
Triton
Trt-Llm
Vllm

Similar Jobs

9 Hours Ago
Remote or Hybrid
2 Locations
Senior level
Senior level
Cloud • Insurance • Payments • Software • Business Intelligence • App development • Big Data Analytics
The Software Engineer/Senior Software Engineer develops and maintains web applications and microservices, collaborates on design discussions, conducts code reviews, ensures testing best practices, and supports team delivery through guidance and technical solutions.
Top Skills: C#Ci/CdCopilotCypressGCPGitGoGrpcJavaScriptKubernetesNode JsPostgresReactRestful ServicesRestsharpSQL ServerTypescriptVisual Studio Code
9 Hours Ago
Remote or Hybrid
Canada
Mid level
Mid level
Cloud • Insurance • Payments • Software • Business Intelligence • App development • Big Data Analytics
The Software Engineer will design, develop, and test web-based solutions, emphasizing high-quality software development and modern testing practices within an Agile setting.
Top Skills: .NetApigeeC#CypressDatadogDockerEnvoyGitlabGoHelmIstioJavaScriptJIRAKubernetesPostgresReactRestsharpSQLTerraformTypescript
9 Hours Ago
Remote or Hybrid
Canada
Senior level
Senior level
Cloud • Insurance • Payments • Software • Business Intelligence • App development • Big Data Analytics
The Sr. Product Manager will lead product strategy development, define roadmaps, conduct market research, and mentor junior team members to align product goals with customer needs.
Top Skills: Data Analysis

What you need to know about the Charlotte Tech Scene

Ranked among the hottest tech cities in 2024 by CompTIA, Charlotte is quickly cementing its place as a major U.S. tech hub. Home to more than 90,000 tech workers, the city’s ecosystem is primed for continued growth, fueled by billions in annual funding from heavyweights like Microsoft and RevTech Labs, which has created thousands of fintech jobs and made the city a go-to for tech pros looking for their next big opportunity.

Key Facts About Charlotte Tech

  • Number of Tech Workers: 90,859; 6.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Lowe’s, Bank of America, TIAA, Microsoft, Honeywell
  • Key Industries: Fintech, artificial intelligence, cybersecurity, cloud computing, e-commerce
  • Funding Landscape: $3.1 billion in venture capital funding in 2024 (CED)
  • Notable Investors: Microsoft, Google, Falfurrias Management Partners, RevTech Labs Foundation
  • Research Centers and Universities: University of North Carolina at Charlotte, Northeastern University, North Carolina Research Campus

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account