Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

admin

Aug 7, 2025 - 04:45

0 0

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

TL;DR

Day zero model performance optimization work is a mix of experimentation, bug fixing, and benchmarking guided by intuition and experience. This writeup outlines the process we followed to achieve SOTA latency and throughput for GPT OSS 120B on NVIDIA GPUs at launch with the Baseten Inference Stack.

The day an open source model like OpenAI’s new gpt-oss-120b is released, we race to make the model as performant as possible for our customers. As a launch partner for OpenAI’s first open-source LLM since 2019, we wanted to give developers a great experience with the new LLMs.

By the end of launch day, we were the clear leader running on NVIDIA GPUs for both latency and throughput per public data from real-world use on OpenRouter.

What matters is having the inference optimization muscle to immediately push on latency and throughput

Optimizing performance on a new model is a substantial engineering challenge. Thanks to our flexible inference stack and the collective expertise of our model performance engineering team, we are able to roll out performance improvements by the hour on new models.

In fact, in the time it took to write this blog post, we added another 100 tokens per second while maintaining 100% uptime.

OpenRouter performance for GPT OSS, 6:45 PM August 6, 2025

Model performance efforts included:

Testing and benchmarking across inference frameworks (TensorRT-LLM, vLLM, and SGLang)
Ensuring compatibility with Hopper and Blackwell GPU architectures
Integrating with key pieces of our inference stack, including NVIDIA Dynamo
Layering in our favorite performance optimizations, like KV cache-aware routing and speculative decoding with Eagle

Below are the steps we took to achieve our goal of SOTA performance with full context window support.

Step 1: Running first inference

The first step is running baseline inference however possible. Running inference on a model requires support at the inference framework, hardware architecture, and model server level.

Inspired by GPUs, we parallelized this effort across multiple engineers. One engineer tried vLLM, another SGLang, and a third worked on TensorRT-LLM. We were able to quickly get TensorRT-LLM working, which was fortunate as it is usually the most performant inference framework for LLMs.

NVIDIA cut a dev release of TensorRT-LLM to support GPT OSS

With TensorRT-LLM, it was important to serve the model on both Hopper and Blackwell architectures to support widely-available H100 GPUs and access the speed of B200 GPUs for our public model APIs.

One key tenet of the Baseten Inference Runtime is flexibility. This is especially useful when serving new models with novel architectures. Navigating, and when necessary updating, the support matrix across the entire stack requires the ability to swap between tools quickly.

Step 2: Fixing compatibility bugs

Whenever a new model architecture is released, there will be subtle bugs and issues when integrating it into existing frameworks. The GPT OSS release added multiple new technologies, including Harmony, a new response format.

Romain Huet of OpenAI warns that implementation details affect performance and output quality.

A large part of our engineering work was iteratively fixing bugs and testing models for both speed and correctness. Where we could, we contributed back to open source with the fixes that worked for us.

Thanks to the hard work of open source maintainers worldwide, there are multiple excellent options for running GPT OSS, and bugs are getting identified and fixed quickly.

Step 3: Optimizing model configuration

While OpenAI advertises that GPT OSS 120B can be run on a single H100 GPU, optimized deployments parallelize the model across 4 or 8 GPUs for improved performance and throughput. There are two parallelism approaches worth considering for this model: Tensor Parallelism and Expert Parallelism.

Tensor Parallelism vs Expert Parallelism for mixture of experts models

We found that Tensor Parallelism offered better latency, while Expert Parallelism offered better system throughput. As we are prioritizing latency, we selected Tensor Parallelism.

Additionally, we adopted the TensorRT-LLM MoE Backend, which is supported on Blackwell but not Hopper. This backend adds improved CUDA kernels that outperform the previous Triton backend. For more details on server configuration, see NVIDIA’s TensorRT-LLM documentation for GPT OSS optimization.

We packaged our preferred configuration for Hopper GPUs for the 120B and 20B models for dedicated deployments in our model library, and used Blackwell for our Model API.

GPT OSS 120B Model API on Baseten, optimized for performance with full context window support

Next steps in performance optimization

These first-pass performance improvements achieved SOTA latency and throughput, but there is a lot more headroom to improve performance on GPT OSS 120B.

One exciting update we’re working on is adding speculative decoding. Speculative decoding uses a smaller “draft” model to guess at future tokens, which are then validated by the target model. We’re big fans of Eagle 3 for speculation, but our inference stack supports 10+ algorithms to ensure that we can pick the right fit for each model and workload.

Speculative decoding accelerates inference by generating multiple tokens per forward pass

If this kind of performance optimization work sounds exciting to you, we’re actively hiring for model performance engineers. But for most AI engineering teams, this kind of performance optimization work shouldn’t stand in the way of testing new models in your product. Whether you’re looking for GPT OSS 120B or any open-source or custom model, get in touch with us for help optimizing your latency and throughput!

Subscribe to our newsletter

Stay up to date on model performance, GPUs, and more.‌