[Kernel] Porting the TRTLLM minimax_allreduce_rms kernels#37045
[Kernel] Porting the TRTLLM minimax_allreduce_rms kernels#37045youkaichao merged 50 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new CUDA kernel for minimax_reduce_rms operations, including a float4 variant, and integrates it into the vLLM framework. Key changes include adding the kernel to the build system, defining its parameters and structures, registering the operations with PyTorch, and implementing a LamportWorkspace for managing CUDA IPC memory. The MambaMixer in minimax_m2.py is updated to utilize a new MiniMaxText01RMSNormAR class for fused Q+K RMS normalization. Review comments highlight the need to address a FIXME comment and a TODO regarding potentially incorrect indexing logic in minimax_reduce_rms_kernel.cu, as well as a TODO for a performance optimization in the local reduction step. Additionally, the max_tokens parameter in linear_attn.py is hardcoded and should be made configurable to prevent memory issues, and a large block of commented-out code needs to be removed for clarity.
| // step 3: calculate the rms norm (input * rsqrt(variance + eps)) | ||
|
|
||
| // load norm weight | ||
| // TODO: correct the access_id_in_token |
There was a problem hiding this comment.
| // TODO: we can do local reduce only within q threads and k threads | ||
| // respectively |
There was a problem hiding this comment.
The TODO on line 377 suggests a potential performance optimization for the local reduction step. The current implementation uses all threads for reducing both Q and K variances, while it might be more efficient to perform these reductions within their respective thread groups (Q-threads and K-threads). Implementing this could improve the kernel's performance.
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
Hi @jeejeelee, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
0b6b533 to
2e63a80
Compare
| @@ -0,0 +1,152 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
need to plug the test into CI.
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
Hi @jeejeelee, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
…ct#37045) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> (cherry picked from commit ecd1ea1)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> (cherry picked from commit ecd1ea1) Signed-off-by: khluu <khluu000@gmail.com>
Cherry-picks the MiniMax QK norm allreduce+RMSNorm Lamport fusion pass from vllm-project#37045 into our 0.17.1-based branch. This replaces the per-layer QK norm allreduce + variance computation + RMSNorm with a single fused Lamport-based CUDA kernel, eliminating multiple kernel launches per layer (62 layers x ~3 kernels saved). Changes: - New CUDA kernel: csrc/minimax_reduce_rms_kernel.{cu,h} - New Lamport IPC workspace: vllm/model_executor/layers/mamba/lamport_workspace.py - New compilation fusion pass: vllm/compilation/passes/fusion/minimax_qk_norm_fusion.py - Config: add fuse_minimax_qk_norm to PassConfig with None default - Pass manager: register MiniMaxQKNormPass after AllReduceFusionPass - Compile ranges: add split point so fusion only applies to decode-size batches - Bindings: register minimax_allreduce_rms and minimax_allreduce_rms_qk ops - Test: tests/kernels/core/test_minimax_reduce_rms.py Adapted compile_ranges_endpoints (upstream API) to compile_ranges_split_points (our 0.17.1 API). Model change (.contiguous() removal) was already done in commit 2cdf163. Enable with: --compilation-config '{"pass_config":{"fuse_minimax_qk_norm":true}}' Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ct#37045) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
…ct#37045) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>


Purpose
See: NVIDIA/TensorRT-LLM#12163
Plan
Test Plan
Accuracy Verification(69f231c)
Performance
20260408 Acc Update
Test Result