# NVIDIA Deep Learning Inference Performance Reproduction Guide This repository provides instructions reproduce inference performance data from the the [NVIDIA Deep Learning Performance - AI Inference](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page. ## Prerequisites Before configuring the orchestrator, ensure you have downloaded the required NVFP4 model weights from Hugging Face: - **DeepSeek-R1 (DSR1):** [DeepSeek-R1-0528-NVFP4-v2](https://huggingface.co/nvidia/DeepSeek-R1-0528-NVFP4-v2) - **Qwen3.5-397B:** [Qwen/Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B) - **Kimi-K2.5:** [Kimi-K2.5-NVFP4](https://huggingface.co/nvidia/Kimi-K2.5-NVFP4) ## Environment Setup Benchmarking is orchestrated using [srt-slurm](https://github.com/ishandhanani/srt-slurm), a command-line tool for distributed LLM inference benchmarks on SLURM clusters. (Support for benchmarking Kubernetes clusters coming soon.) 1. **Clone and Install:** ```bash # Enter a directory on NFS, accessible by all nodes of your cluster. git clone https://github.com/ishandhanani/srt-slurm.git cd srt-slurm git checkout recipes/moe # Initialize virtual environment and install dependencies (not shown) uv venv uv pip install -e . ``` 2. **Initialize SLURM Workspace:** Execute the setup command below. You will be prompted to specify your SLURM account and partition. ```bash #One-time setup (downloads NATS/ETCD, creates srtslurm.yaml) make setup ARCH=aarch64 # or ARCH=x86_64 ``` 3. **Configure Model Paths:** The setup script will generate an `srtslurm.yaml` file. Edit this file to append your local model paths: ```yaml model_path: dsr1: /path/to/local/dsr1 qwen3.5-397b: /path/to/local/qwen3.5-397b kimi-k2.5: /path/to/local/kimi-k2.5 ``` Depending on your cluster configuration, you may need to specify additional arguments in srtslurm.yaml. See https://github.com/ishandhanani/srt-slurm/blob/main/srtslurm.yaml.example for details. ## Running the Benchmarks To execute a benchmark, apply the target configuration file using the `srtctl` CLI: ```bash srtctl apply -f ``` Available benchmarking configurations for published performance data are mapped below. Select the recipe that matches your target performance profile. | Model | 1K/1K | 8K/1K | 1K/8K | 128K/8K | | :--------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------ | :----------------------------------------------------------------------------------------------- | | **DSR1** | | [GB300](https://github.com/ishandhanani/srt-slurm/blob/main/recipes/gb300-fp4/1k1k/max_tpt.yaml) | [GB300](https://github.com/ishandhanani/srt-slurm/tree/main/recipes/gb300-fp4/8k1k) | [GB300](https://github.com/ishandhanani/srt-slurm/blob/main/recipes/gb300-fp4/1k8k/max-tpt.yaml) | | **gpt-oss-120b** | [B200](https://github.com/ishandhanani/srt-slurm/tree/main/recipes/trtllm/b200-fp4/1k1k/mtp), [H200](https://github.com/ishandhanani/srt-slurm/tree/main/recipes/trtllm/h200/1k1k) | [B200](https://github.com/ishandhanani/srt-slurm/tree/main/recipes/trtllm/b200-fp4/8k1k/mtp), [H200](https://github.com/ishandhanani/srt-slurm/tree/main/recipes/trtllm/h200/8k1k/mtp) | | | | **Kimi-K2.5** | [B200](https://github.com/ishandhanani/srt-slurm/tree/recipes/moe/recipes/kimi-k2.5/b200/1k1k) | [B200](https://github.com/ishandhanani/srt-slurm/tree/recipes/moe/recipes/kimi-k2.5/b200/8k1k) | [B200](https://github.com/ishandhanani/srt-slurm/tree/recipes/moe/recipes/kimi-k2.5/b200/1k8k) | | | **Qwen3.5-397B** | [B200](https://github.com/ishandhanani/srt-slurm/tree/recipes/moe/recipes/qwen3.5-397b/b200/1k1k) | [B200](https://github.com/ishandhanani/srt-slurm/tree/recipes/moe/recipes/qwen3.5-397b/b200/8k1k) | [B200](https://github.com/ishandhanani/srt-slurm/tree/recipes/moe/recipes/qwen3.5-397b/b200/1k8k) | | ## Support Terminology used in recipes is explained in the [Appendix](APPENDIX.md). For questions or to provide feedback, please contact LLMBenchmarks@nvidia.com