Omni Router Usage#
The SGLang-Omni Router is an external HTTP router for Omni V1 deployments. It fronts multiple complete Omni V1 API servers and exposes one OpenAI-compatible endpoint to clients.
Use the router when you launch more than one sgl-omni serve process and want
one stable endpoint for request distribution, health tracking, and worker-pool
control.
Router Topology#
The router is an external HTTP process:
client
|
v
sgl-omni-router
|
+-- sgl-omni serve worker A
+-- sgl-omni serve worker B
Each worker is a complete Omni V1 HTTP server. The router does not load model weights or split a single request across workers. It selects one routable worker for each request, forwards the original request bytes, and returns the worker response with router diagnostic headers.
Launch Workers and Router From YAML#
For a local homogeneous pool, sgl-omni-router can start the worker replicas
and then start the router after all managed workers pass /health:
sgl-omni-router \
--host 0.0.0.0 \
--port 8008 \
--launcher-config examples/configs/qwen3_omni_router.yaml \
--policy round_robin \
--health-failure-threshold 2 \
--health-success-threshold 1 \
--health-check-interval-secs 10 \
--log-level info
Example launcher config:
launcher:
backend: local
model_path: Qwen/Qwen3-Omni-30B-A3B-Instruct
model_name: qwen3-omni
num_workers: 2
num_gpus_per_worker: 1
worker_host: 127.0.0.1
worker_base_port: 8011
worker_extra_args: "--config examples/configs/qwen3_omni_colocated_h20.yaml --colocate"
wait_timeout: 600
backend: local means the router process starts and manages worker
subprocesses on the same machine. The launched workers are complete Omni V1
servers started with sgl-omni serve; they are not partial
pipeline stages. The router waits for every managed worker to pass /health
before it starts accepting client traffic, and it stops those managed workers
when the router exits.
num_gpus_per_worker controls automatic GPU grouping. The default Qwen3-Omni
router example uses colocated workers: each complete speech worker runs on one
GPU through examples/configs/qwen3_omni_colocated_h20.yaml. With
num_workers: 2 and num_gpus_per_worker: 1, the launcher assigns GPU 0 to
the first worker and GPU 1 to the second worker when two CUDA devices are
visible.
Use examples/configs/qwen3_omni_colocated_h200.yaml instead for single-H200
workers.
Set worker_gpu_ids only when you need explicit placement. Each entry maps one
CUDA_VISIBLE_DEVICES value to one worker, for example
worker_gpu_ids: ["0", "1"] for two one-GPU colocated Qwen3-Omni workers. Use
worker_extra_args: "--text-only" only if you intentionally want text-output
workers instead of speech-output workers.
Use worker_extra_args for public Omni V1 serve options that are specific to
the worker process, such as --mem-fraction-static, --thinker-tp-size, or
--text-only. These arguments are passed to sgl-omni serve
after the launcher-owned flags. When no memory flags are provided, Omni V1 uses
its normal auto-sizing path.
Use worker_capabilities when managed workers intentionally expose only part
of the Omni API surface. For example, text-only workers should not advertise
speech or audio-output support:
launcher:
backend: local
model_path: Qwen/Qwen3-Omni-30B-A3B-Instruct
model_name: qwen3-omni
num_workers: 2
num_gpus_per_worker: 1
worker_extra_args: "--text-only"
worker_capabilities:
- chat
- streaming
- image_input
- audio_input
- video_input
If worker_capabilities is omitted and worker_extra_args contains
--text-only, the router registers the managed workers with the same text-only
capability set shown above.
For short audio-input / text-output MMSU-style workloads, use the fused text-path Qwen3-Omni config instead of the default speech-colocated worker:
launcher:
backend: local
model_path: Qwen/Qwen3-Omni-30B-A3B-Instruct
model_name: qwen3-omni
num_workers: 2
num_gpus_per_worker: 1
worker_extra_args: "--config examples/configs/qwen3_omni_mmsu.yaml --text-only"
This keeps preprocessing, encoders, aggregation, thinker, and decode in one worker process while leaving the general speech-colocated topology unchanged.
Launch Worker Servers Manually#
Start each Omni V1 worker separately. The example below launches two colocated Qwen3-Omni speech workers on different GPUs and ports:
CUDA_VISIBLE_DEVICES=0 sgl-omni serve \
--model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
--model-name qwen3-omni \
--config examples/configs/qwen3_omni_colocated_h20.yaml \
--colocate \
--host 0.0.0.0 \
--port 8011
CUDA_VISIBLE_DEVICES=1 sgl-omni serve \
--model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
--model-name qwen3-omni \
--config examples/configs/qwen3_omni_colocated_h20.yaml \
--colocate \
--host 0.0.0.0 \
--port 8012
Worker URLs passed to the router must be base URLs such as
http://127.0.0.1:8011. Do not include endpoint paths, query strings, or
fragments.
Launch the Router#
Start the router with the worker URLs:
sgl-omni-router \
--host 0.0.0.0 \
--port 8008 \
--worker-urls http://127.0.0.1:8011 http://127.0.0.1:8012 \
--policy round_robin \
--health-failure-threshold 2 \
--health-success-threshold 1 \
--health-check-interval-secs 10 \
--log-level info
Router Arguments#
The table below lists the router command-line arguments.
Argument |
Default |
Description |
|---|---|---|
|
|
Host interface for the router HTTP server. |
|
|
Port for the router HTTP server. |
|
not set |
Space-separated Omni V1 worker base URLs for a homogeneous worker pool. |
|
not set |
JSON file that defines workers and optional per-worker model/capability metadata. |
|
not set |
YAML file for a managed local worker pool. Do not use with |
|
|
Routing policy: |
|
not set |
Model name assigned to every worker when using |
|
|
Timeout for proxied worker requests. |
|
|
Maximum request body size accepted by the router, in bytes. |
|
|
Maximum number of HTTP connections used by the router’s upstream worker client. |
|
|
Consecutive failed health checks or routed request failures before a worker becomes unhealthy. |
|
|
Consecutive successful health checks before an unhealthy or unknown worker becomes healthy. |
|
|
Timeout for one worker health-check request. |
|
|
Interval between background worker health checks. |
|
|
Worker endpoint used by background health checks. |
|
|
Router and Uvicorn log level. |
Routing policies:
round_robin: rotates through routable workers in order.least_request: selects a routable worker with the fewest active data-plane requests, then round-robins among ties.random: selects a random routable worker.
Pass exactly one of --launcher-config, --worker-urls, or
--worker-config. Use --worker-config when workers serve different models or
only a subset of Omni capabilities:
{
"workers": [
{
"url": "http://127.0.0.1:8011",
"model": "qwen3-omni",
"capabilities": ["chat", "image_input", "video_input"]
},
{
"url": "http://127.0.0.1:8012",
"model": "qwen3-omni",
"capabilities": ["chat", "audio_input", "audio_output", "speech"]
}
]
}
Then launch with:
sgl-omni-router \
--host 0.0.0.0 \
--port 8008 \
--worker-config workers.json \
--policy least_request
Check Router and Worker State#
The router exposes separate process and worker-pool health surfaces:
curl -i http://127.0.0.1:8008/live
curl -i http://127.0.0.1:8008/ready
curl -i http://127.0.0.1:8008/health
curl -s http://127.0.0.1:8008/workers
curl -s http://127.0.0.1:8008/v1/models
The endpoints have different meanings:
GET /live: the router process is running. This does not wait for workers to become healthy.GET /ready: at least one worker is routable. This returns503when all workers are unhealthy, dead, disabled, or still unknown.GET /health: worker-pool health summary. This returns503when no worker is routable.GET /workers: detailed worker state, includinghealth_state,disabled,routable,active_requests, failure counters, and last error.GET /v1/models: merged model list from routable workers.
Send Requests Through the Router#
Point clients at the router port instead of the worker ports. The request schema is the same OpenAI-compatible schema used by each worker server.
Image input with text output:
curl -i http://127.0.0.1:8008/v1/chat/completions \
-H "Content-Type: application/json" \
-H "x-request-id: router-image-1" \
-d '{
"model": "qwen3-omni",
"messages": [
{"role": "user", "content": "How many cars are there in the image? Answer briefly."}
],
"images": ["tests/data/cars.jpg"],
"modalities": ["text"],
"max_tokens": 16
}'
Streaming text:
curl -N http://127.0.0.1:8008/v1/chat/completions \
-H "Content-Type: application/json" \
-H "x-request-id: router-stream-1" \
-d '{
"model": "qwen3-omni",
"messages": [{"role": "user", "content": "Say hello briefly."}],
"stream": true,
"max_tokens": 16
}'
The router preserves the original request body. For ordinary JSON requests, it parses a bounded amount of request metadata for worker selection and forwards the original bytes to the selected worker.
Manage Workers#
Add a worker at runtime:
curl -s http://127.0.0.1:8008/workers \
-H "Content-Type: application/json" \
-d '{"url":"http://127.0.0.1:8013","model":"qwen3-omni"}'
Disable a worker without deleting it:
curl -s -X PUT http://127.0.0.1:8008/workers/http%3A%2F%2F127.0.0.1%3A8013 \
-H "Content-Type: application/json" \
-d '{"disabled":true}'
Mark a worker dead for manual quarantine:
curl -s -X PUT http://127.0.0.1:8008/workers/http%3A%2F%2F127.0.0.1%3A8013 \
-H "Content-Type: application/json" \
-d '{"is_dead":true}'
Recover a manually dead worker:
curl -s -X PUT http://127.0.0.1:8008/workers/http%3A%2F%2F127.0.0.1%3A8013 \
-H "Content-Type: application/json" \
-d '{"is_dead":false}'
Delete a worker:
curl -s -X DELETE http://127.0.0.1:8008/workers/http%3A%2F%2F127.0.0.1%3A8013
Worker update requests are atomic. If an update returns 400, the live worker
state is not partially changed.
Routing Behavior#
The router only selects workers that are healthy, not disabled, and capable of serving the request.
The default worker capability set represents a complete Omni V1 replica:
chatspeechstreamingimage_inputaudio_inputvideo_inputaudio_output
The router infers required capabilities from each request:
/v1/chat/completionsrequireschatstream: truerequiresstreamingimages,image, or image message parts requireimage_inputaudios,audio_inputs, or audio message parts requireaudio_inputvideos,video, or video message parts requirevideo_inputmodalities: ["audio"]oraudiooutput fields requireaudio_output/v1/audio/speechrequiresspeech, plusstreamingfor streamed speech
Register narrower worker capabilities only when a worker cannot serve one of those request classes.
Large JSON requests are not fully parsed by the router. With a homogeneous pool of complete Omni V1 replicas, no extra headers are needed. With mixed models, provide a model hint. With mixed worker capabilities, provide a capability hint when the router cannot infer a single safe worker set:
X-SGLang-Omni-Route-Model: requested model for mixed-model poolsX-SGLang-Omni-Route-Capabilities: comma-separated capabilities such asimage_input,audio_input,video_input,audio_output, orstreamingX-SGLang-Omni-Route-Stream:trueorfalsefor large streaming requests
These headers are router-only hints and are not forwarded to workers.
Request Diagnostics#
Routed responses include:
X-SGLang-Omni-Worker: selected worker IDX-SGLang-Omni-Request-ID: request ID from the request headers or body, or a router-generated IDX-SGLang-Omni-Route-Attempt: currently1
Router logs include a route-completion record for buffered and streaming requests. Each record contains the request ID, selected worker, path, stream flag, inferred capabilities, status code, duration, and terminal outcome.
Failure Handling#
If a worker health check or routed request fails repeatedly, the worker becomes unhealthy and leaves the routable pool. It can return to healthy after the configured number of successful health checks.
To inspect failover behavior:
Stop one worker.
Call
GET /workersand check itsconsecutive_failures,health_state, androutablefields.Send another request through the router and verify that it uses a remaining routable worker.
Restart the stopped worker and wait for it to become healthy again.
For a source checkout without installed console scripts, verify the module entry point with:
python -m sglang_omni_router.serve --help