Optimize AI workload performance on NVIDIA AI infrastructure.
Overview
NVIDIA Performance Benchmarking is a suite of tools, recipes, and services that take the guesswork out of measuring performance of AI workloads and infrastructure. NVIDIA Performance Benchmarking provides a standardized and objective means of gauging performance across platforms, essential to optimizing AI workloads and speeding outcomes.
Optimize AI workload performance on any NVIDIA accelerated infrastructure with NVIDIA Performance Benchmarking’s suite of tools, services, and recipes.
Using Performance Explorer, users can identify the ideal GPU count that minimizes both total training time and costs. The objective is to identify the right number of GPUs for a given workload that maximizes throughput and minimizes expenses—across projects and teams.
Get the most out of your AI workload environments and unlock the full potential of your AI infrastructure with NVIDIA Performance Benchmarking.
Determine which platform can deliver the fastest time to train or desired GPU scale and at what cost using real-time and end-to-end performance data.
Tune and optimize your AI workloads according to end-to-end metrics tailored to the performance of modern generative AI applications.
Evaluate beyond the GPUs, including infrastructure software, cloud platforms, and application configurations, to gain a holistic view of workload performance.
Get a standardized and objective means of gauging platform performance, and understand the expected performance for given workloads or use cases.
In MLPerf Inference v6.0 (April 2026), systems powered by NVIDIA Blackwell Ultra GPUs (GB300 NVL72) delivered the highest throughput across the widest range of models and scenarios. On DeepSeek-R1, GB300 NVL72 delivered 2.5 million tokens per second—up to 2.7x higher token throughput compared to GB300 NVL72 debut submissions just six months prior, as a result of TensorRT-LLM software updates.
When measuring AI inference cost-effectiveness, it is important to look beyond compute pricing or FLOPs per dollar because these metrics give an incomplete view. The most important metric for AI inference cost-effectiveness is cost per token, or the price-performance actually delivered, especially on MoE and reasoning models. NVIDIA GB300 NVL72 delivers AI inference at $0.123 per million tokens at 116 TPS/user interactivity using NVIDIA Dynamo and TensorRT™-LLM—the lowest cost per token among major platforms, according to SemiAnalysis InferenceX benchmarks as of April 2026.
NVIDIA Blackwell B200 achieves $0.02 per million tokens on GPT-OSS-120B using TensorRT-LLM, according to SemiAnalysis InferenceX benchmarks as of April 2026—a 5x improvement from launch-day costs of $0.11/M tokens achieved through software optimization alone.
NVIDIA B300 (Blackwell Ultra) was designed to meet the increased compute and memory capacity demands of long-context and reasoning AI inference. With a 1.5x increase in dense FP4 performance, 2x attention performance, and 1.5x more HBM memory compared with NVIDIA B200, B300 is able to boost AI reasoning throughput for the largest context lengths. GB300 NVL72 delivers AI inference at $0.123 per million tokens at 116 TPS/user interactivity using NVIDIA Dynamo and TensorRT-LLM—the lowest cost per token among major platforms, according to SemiAnalysis InferenceX benchmarks as of April 2026.
There are a few third-party independent AI inference benchmarks widely used across the industry today. MLPerf Inference is the industry-standard benchmark from MLCommons, measuring throughput and latency across standardized workloads. InferenceX, by SemiAnalysis, is the first independent benchmark for measuring total cost of compute across diverse models and real-world scenarios. InferenceX v2 extends this to benchmark the full Pareto frontier curve. As of April 2026, NVIDIA Blackwell Ultra (GB300 NVL72) leads across all three benchmark suites.
Achieve optimal AI workload performance per TCO in partnership with NVIDIA with data-driven validated benchmarks.
Access technical documentation about NVIDIA Cloud Accelerator.