This repository provides instructions reproduce inference performance data from the the NVIDIA Deep Learning Performance - AI Inference page.
Before configuring the orchestrator, ensure you have downloaded the required NVFP4 model weights from Hugging Face:
- DeepSeek-R1 (DSR1): DeepSeek-R1-0528-NVFP4-v2
- Qwen3.5-397B: Qwen/Qwen3.5-397B-A17B
- Kimi-K2.5: Kimi-K2.5-NVFP4
Benchmarking is orchestrated using srt-slurm, a command-line tool for distributed LLM inference benchmarks on SLURM clusters. (Support for benchmarking Kubernetes clusters coming soon.)
- Clone and Install:
# Enter a directory on NFS, accessible by all nodes of your cluster.
git clone https://github.com/ishandhanani/srt-slurm.git
cd srt-slurm
git checkout recipes/moe
# Initialize virtual environment and install dependencies (not shown)
uv venv
uv pip install -e .- Initialize SLURM Workspace: Execute the setup command below. You will be prompted to specify your SLURM account and partition.
#One-time setup (downloads NATS/ETCD, creates srtslurm.yaml)
make setup ARCH=aarch64 # or ARCH=x86_64- Configure Model Paths:
The setup script will generate an
srtslurm.yamlfile. Edit this file to append your local model paths:
model_path:
dsr1: /path/to/local/dsr1
qwen3.5-397b: /path/to/local/qwen3.5-397b
kimi-k2.5: /path/to/local/kimi-k2.5Depending on your cluster configuration, you may need to specify additional arguments in srtslurm.yaml. See https://github.com/ishandhanani/srt-slurm/blob/main/srtslurm.yaml.example for details.
To execute a benchmark, apply the target configuration file using the srtctl CLI:
srtctl apply -f <path-to-config-file>Available benchmarking configurations for published performance data are mapped below. Select the recipe that matches your target performance profile.
| Model | 1K/1K | 8K/1K | 1K/8K | 128K/8K |
|---|---|---|---|---|
| DSR1 | GB300 | GB300 | GB300 | |
| gpt-oss-120b | B200, H200 | B200, H200 | ||
| Kimi-K2.5 | B200 | B200 | B200 | |
| Qwen3.5-397B | B200 | B200 | B200 |
Terminology used in recipes is explained in the Appendix.
For questions or to provide feedback, please contact LLMBenchmarks@nvidia.com