A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the Router Guide.
The standalone router provides configurable KV-aware routing for any set of workers in a Dynamo deployment. It can be used for disaggregated serving (e.g., routing to prefill workers), multi-tier architectures, or any scenario requiring intelligent KV cache-aware routing decisions.
This component is fully configurable and works with any Dynamo backend (vLLM, TensorRT-LLM, SGLang, etc.) and any worker endpoint.
python -m dynamo.router \
--endpoint dynamo.prefill.generate \
--router-block-size 64 \
--router-reset-states \
--no-router-track-active-blocksRequired:
--endpoint: Full endpoint path for workers in the formatnamespace.component.endpoint(e.g.,dynamo.prefill.generate)
Router Configuration:
All router options use the --router-* prefix (e.g., --router-block-size, --router-kv-overlap-score-weight, --router-temperature, --router-kv-events / --no-router-kv-events, --router-replica-sync, --router-snapshot-threshold, --router-reset-states, --router-track-active-blocks / --no-router-track-active-blocks). Legacy names without the prefix (e.g., --block-size, --kv-events) are still accepted but deprecated. For detailed descriptions, see the Router Guide.
The standalone router exposes two endpoints via the Dynamo runtime:
generate: Routes requests to the best worker and streams back generation results (KV-aware routing).best_worker_id: Given token IDs, returns the best worker ID for the request without routing; useful for debugging or custom routing logic.
Clients call the generate endpoint to stream completions, or call best_worker_id to decide which worker to use and then contact that worker directly.
Note
This is an alternative advanced setup. The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with ModelType.Prefill. See the Router Guide for the default setup.
Use this manual setup if you need explicit control over prefill routing configuration or want to manage prefill and decode routers separately.
See examples/backends/vllm/launch/disagg_router.sh for a complete example.
# Start frontend router for decode workers
python -m dynamo.frontend \
--router-mode kv \
--http-port 8000 \
--kv-overlap-score-weight 0 # Pure load balancing for decode
# Start standalone router for prefill workers
python -m dynamo.router \
--endpoint dynamo.prefill.generate \
--router-block-size 64 \
--router-reset-states \
--no-router-track-active-blocks
# Start decode workers
python -m dynamo.vllm --model MODEL_NAME --block-size 64 &
# Start prefill workers
python -m dynamo.vllm --model MODEL_NAME --block-size 64 --disaggregation-mode prefill &Note
Why --no-router-track-active-blocks for prefill routing?
Active block tracking is used for load balancing across decode (generation) phases. For prefill-only routing, decode load is not relevant, so disabling this reduces overhead and simplifies the router state.
Why --router-block-size is required for standalone routers:
Unlike the frontend router which can infer block size from the ModelDeploymentCard (MDC) during worker registration, standalone routers cannot access the MDC and must have the block size explicitly specified. This is a work in progress to enable automatic inference.
Note
Block Size Matching: The block size must match across:
- Standalone router (
--router-block-size) - All worker instances (backend-specific, e.g.
--block-sizefor vLLM)
Endpoint Matching:
The --endpoint argument must match where your target workers register. For example:
- vLLM prefill workers:
dynamo.prefill.generate - vLLM decode workers:
dynamo.backend.generate - Custom workers:
<your_namespace>.<your_component>.<your_endpoint>
To integrate the standalone router with a backend:
- Workers should register at the endpoint specified by the
--endpointargument - Clients call the
router.generateendpoint to stream completions (router selects the best worker), or callrouter.best_worker_idto get the best worker ID and then send requests to that worker - Router state is updated automatically as requests are routed; no separate "free" call is required
See components/src/dynamo/vllm/handlers.py for a reference implementation (search for prefill_router_client).
- Router Guide - Configuration and tuning for KV-aware routing
- Router Design - Architecture details and event transport modes
- Frontend Router - Main HTTP frontend with integrated routing
- Router Benchmarking - Performance testing and tuning