Qwen3-Omni#
Qwen3-Omni is a multi-modal model that accepts text, image, audio, and video input and can produce text-only or text + audio output. This page covers every supported server configuration β use the generator to get the exact launch command for your hardware, then check the tables to confirm your combination is supported.
Prerequisites#
docker pull frankleeeee/sglang-omni:dev
docker run -it --shm-size 32g --gpus all frankleeeee/sglang-omni:dev /bin/zsh
git clone https://github.com/sgl-project/sglang-omni.git
cd sglang-omni
uv venv .venv -p 3.12 && source .venv/bin/activate
uv pip install -v .
Server Configuration#
Use the selector below to generate the exact launch command for your configuration.
Compatibility Matrix#
Colocated topology requires --config examples/configs/qwen3_omni_colocated_h20.yaml
(or qwen3_omni_colocated_h200.yaml on H200) to set per-stage GPU memory budgets.
Mode |
Topology |
Thinker TP |
Precision |
Status |
|---|---|---|---|---|
Thinker-only |
β |
β |
BF16 |
β |
Thinker-only |
β |
β |
FP8 |
β |
Thinker-Talker |
Disaggregated |
TP=1 |
BF16 |
β |
Thinker-Talker |
Disaggregated |
TP=1 |
FP8 |
β |
Thinker-Talker |
Disaggregated |
TP=2 |
BF16 |
β |
Thinker-Talker |
Disaggregated |
TP=2 |
FP8 |
β |
Thinker-Talker |
Colocated |
TP=1 |
BF16 |
β |
Thinker-Talker |
Colocated |
TP=1 |
FP8 |
β |
Input / Output Modalities#
All input modality combinations work with both text-only and speech servers.
modalities: ["text", "audio"] requires a speech-mode server (omit --text-only).
Input |
Output |
Speech server |
Minimal request body |
Notes |
|---|---|---|---|---|
Text |
Text |
No |
|
β |
Image + text |
Text |
No |
|
β |
Audio |
Text |
No |
|
content must be ββ when the query is spoken |
Image + audio |
Text |
No |
|
content must be ββ when the query is spoken |
Image |
Text |
No |
|
content must be ββ when query comes from image |
Video + text |
Text |
No |
|
β |
Video + audio |
Text |
No |
|
content must be ββ when the query is spoken |
Video |
Text |
No |
|
content must be ββ when query comes from video |
Text |
Text + Audio |
Yes |
|
β |
Image + text |
Text + Audio |
Yes |
|
β |
Audio |
Text + Audio |
Yes |
|
content must be ββ when the query is spoken |
Image + audio |
Text + Audio |
Yes |
|
content must be ββ when the query is spoken |
Image |
Text + Audio |
Yes |
|
content must be ββ when query comes from image |
Video + text |
Text + Audio |
Yes |
|
β |
Video + audio |
Text + Audio |
Yes |
|
content must be ββ when the query is spoken |
Video |
Text + Audio |
Yes |
|
content must be ββ when query comes from video |
Sampling Parameters#
Standard sampling parameters apply to the thinker stage. When modalities includes "audio", the additional talker-specific parameters below control the speech generation independently.
Parameter |
Type |
Default |
Applies to |
|---|---|---|---|
|
float |
|
Thinker |
|
float |
|
Thinker |
|
int |
|
Thinker |
|
float |
|
Thinker |
|
float |
|
Thinker |
|
int |
|
Thinker |
|
str | list |
|
Thinker |
|
int |
|
Thinker |
|
bool |
|
Both |
|
dict |
|
Talker (speech output only) β format config, e.g. |
|
float |
|
Talker (audio output only) |
|
float |
|
Talker (audio output only) |
|
int |
|
Talker (audio output only) |
|
float |
|
Talker (audio output only) |
|
int |
|
Talker (audio output only) |
|
dict |
|
Per-stage sampling override |
|
dict |
|
Per-stage non-sampling params |
|
float |
|
Frame sampling rate for video input (uses server default if unset) |
|
int |
|
Maximum number of frames sampled from a video |
|
int |
|
Minimum pixels per video frame |
|
int |
|
Maximum pixels per video frame |
|
int |
|
Total pixel budget across all video frames |
Known Limitations#
modalities: ["text", "audio"]has no effect on a text-only server. No error is raised β the response simply contains no audio. Use a speech-mode server (without--text-only) to get audio output.contentmust be""when the query is entirely inaudios,videos, orimages. Leaving a text query incontentalongside audio causes the model to process both, which is usually not what you want.Colocated topology does not support
--thinker-tp-size 2. The server raises aValueErrorat startup (βQwen Phase 1 colocation does not support thinker TPβ). Use disaggregated topology for TP=2.Requests that exceed the modelβs context length are rejected with an error. The preprocessor raises a
ValueErrorwhen the prompt token count alone meets or exceedsmax_seq_len, or whenprompt tokens + max_new_tokens β₯ max_seq_len. Reduce input length or lowermax_tokensto stay within the limit.