Skip to content

Latest commit

 

History

History
71 lines (49 loc) · 5.86 KB

File metadata and controls

71 lines (49 loc) · 5.86 KB

NVIDIA Deep Learning Inference Performance Reproduction Guide

This repository provides instructions reproduce inference performance data from the the NVIDIA Deep Learning Performance - AI Inference page.

Prerequisites

Before configuring the orchestrator, ensure you have downloaded the required NVFP4 model weights from Hugging Face:

Environment Setup

Benchmarking is orchestrated using srt-slurm, a command-line tool for distributed LLM inference benchmarks on SLURM clusters. (Support for benchmarking Kubernetes clusters coming soon.)

  1. Clone and Install:
# Enter a directory on NFS, accessible by all nodes of your cluster.
git clone https://github.com/ishandhanani/srt-slurm.git 
cd srt-slurm
git checkout recipes/moe

# Initialize virtual environment and install dependencies (not shown)
uv venv
uv pip install -e .
  1. Initialize SLURM Workspace: Execute the setup command below. You will be prompted to specify your SLURM account and partition.
#One-time setup (downloads NATS/ETCD, creates srtslurm.yaml)
make setup ARCH=aarch64  # or ARCH=x86_64
  1. Configure Model Paths: The setup script will generate an srtslurm.yaml file. Edit this file to append your local model paths:
model_path:
  dsr1: /path/to/local/dsr1
  qwen3.5-397b: /path/to/local/qwen3.5-397b
  kimi-k2.5: /path/to/local/kimi-k2.5

Depending on your cluster configuration, you may need to specify additional arguments in srtslurm.yaml. See https://github.com/ishandhanani/srt-slurm/blob/main/srtslurm.yaml.example for details.

Running the Benchmarks

To execute a benchmark, apply the target configuration file using the srtctl CLI:

srtctl apply -f <path-to-config-file>

Available benchmarking configurations for published performance data are mapped below. Select the recipe that matches your target performance profile.

Model 1K/1K 8K/1K 1K/8K 128K/8K
DSR1 GB300 GB300 GB300
gpt-oss-120b B200, H200 B200, H200
Kimi-K2.5 B200 B200 B200
Qwen3.5-397B B200 B200 B200

Support

Terminology used in recipes is explained in the Appendix.

For questions or to provide feedback, please contact LLMBenchmarks@nvidia.com