Omni Model Usage#
This guide uses Qwen3-Omni as an example omni model with SGLang-Omni and the OpenAI-compatible API. Qwen3-Omni supports multi-modal input (text, image, audio) and can produce text-only or text + audio output depending on the mode.
Prerequisites#
Install sglang-omni by following Installation.
Text-Only Mode#
Text-only mode runs the thinker pipeline on a single GPU. It accepts multi-modal input (text, image, audio) and produces text output only.
Launch the Server#
sgl-omni serve \
--model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
--text-only \
--port 8008
For MMSU-style audio-input / text-output benchmarks with short requests, use the fused text-path config so the full text path stays inside one worker process:
sgl-omni serve \
--model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
--config examples/configs/qwen3_omni_mmsu.yaml \
--text-only \
--port 8008
Image and Text Input#
Send an image with a text question to get a text response.
cURL
curl -X POST http://localhost:8008/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-omni",
"messages": [{"role": "user", "content": "How many cars are there in the picture?"}],
"images": ["tests/data/cars.jpg"],
"modalities": ["text"],
"max_tokens": 16
}'
Python
import requests
resp = requests.post(
"http://localhost:8008/v1/chat/completions",
json={
"model": "qwen3-omni",
"messages": [{"role": "user", "content": "How many cars are there in the picture?"}],
"images": ["tests/data/cars.jpg"],
"modalities": ["text"],
"max_tokens": 16,
},
)
resp.raise_for_status()
result = resp.json()
print(result["choices"][0]["message"]["content"])
Audio and Image Input#
Send an audio file together with an image. The audio contains the spoken question (“How many cars are there in the picture?”) and the model answers based on both inputs.
Note: Set
"content": ""(empty string) on the user message when all semantic content comes from audio, video, or images rather than text.
cURL
curl -X POST http://localhost:8008/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-omni",
"messages": [{"role": "user", "content": ""}],
"images": ["tests/data/cars.jpg"],
"audios": ["tests/data/query_to_cars.wav"],
"modalities": ["text"],
"max_tokens": 16
}'
Python
import requests
resp = requests.post(
"http://localhost:8008/v1/chat/completions",
json={
"model": "qwen3-omni",
"messages": [{"role": "user", "content": ""}],
"images": ["tests/data/cars.jpg"],
"audios": ["tests/data/query_to_cars.wav"],
"modalities": ["text"],
"max_tokens": 16,
},
)
resp.raise_for_status()
result = resp.json()
print(result["choices"][0]["message"]["content"])
Video and Audio Input#
Send a video with a spoken audio question. The model watches the video, hears the question, and responds with text.
The Video-AMME CI benchmark uses this same modality combination: video input plus a spoken question/options WAV, with only routing and answer-format instructions in the text message.
cURL
curl -X POST http://localhost:8008/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-omni",
"messages": [{"role": "user", "content": ""}],
"videos": ["tests/data/draw.mp4"],
"audios": ["tests/data/query_to_draw.wav"],
"modalities": ["text"],
"max_tokens": 16
}'
Python
import requests
resp = requests.post(
"http://localhost:8008/v1/chat/completions",
json={
"model": "qwen3-omni",
"messages": [{"role": "user", "content": ""}],
"videos": ["tests/data/draw.mp4"],
"audios": ["tests/data/query_to_draw.wav"],
"modalities": ["text"],
"max_tokens": 16,
},
)
resp.raise_for_status()
result = resp.json()
print(result["choices"][0]["message"]["content"])
Speech Mode#
Speech mode runs the full 9-stage pipeline across multiple GPUs. It produces both text (from the thinker) and audio (from the talker) output.
Launch the Server#
Speech mode can run as a colocated one-GPU worker using the colocated config:
sgl-omni serve \
--model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
--config examples/configs/qwen3_omni_colocated_h20.yaml \
--colocate \
--port 8008
Use examples/configs/qwen3_omni_colocated_h200.yaml on single-H200 workers.
For manual multi-GPU placement, use the example script:
python examples/run_qwen3_omni_speech_server.py \
--model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
--gpu-thinker 0 \
--gpu-talker 1 \
--gpu-code-predictor 1 \
--gpu-code2wav 1 \
--port 8008
Or use the CLI without --text-only for the standard speech pipeline:
sgl-omni serve \
--model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
--port 8008
By default, leave mem_fraction_static unset and let SGLang-Omni auto-size the
SGLang AR memory budget. If a specific machine needs manual tuning, you can pin
the value globally or per AR stage:
sgl-omni serve \
--model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
--port 8008 \
--mem-fraction-static 0.88
Use per-stage flags when the thinker and talker need different budgets:
sgl-omni serve \
--model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
--port 8008 \
--thinker-mem-fraction-static 0.88 \
--talker-mem-fraction-static 0.88
The speech server launcher exposes the same per-stage controls:
python examples/run_qwen3_omni_speech_server.py \
--model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
--gpu-thinker 0 \
--gpu-talker 1 \
--gpu-code-predictor 1 \
--gpu-code2wav 1 \
--port 8008 \
--thinker-mem-fraction-static 0.88 \
--talker-mem-fraction-static 0.88
--mem-fraction-static applies to both Qwen AR stages. Per-stage flags override
the global value for that stage. Values must be greater than 0 and less than
1.
Single-GPU FP8 on H100/H20#
SGLang-Omni can also serve native FP8 Qwen3-Omni checkpoints. Native FP8 uses the checkpoint quantization config when loading the thinker and talker AR stages, while keeping the same Qwen3-Omni request format shown below.
For one-GPU H100/H20 colocated launch, use the FP8 colocated config:
sgl-omni serve \
--config examples/configs/qwen3_omni_fp8_colocated.yaml \
--colocate \
--model-name qwen3-omni \
--port 8008
The config file contains the FP8 checkpoint path:
marksverdhei/Qwen3-Omni-30B-A3B-FP8. You can still pass --model-path to
override the config value.
The FP8 path keeps dense FP8 GEMM on SGLang auto and defaults native FP8 MoE
to CUTLASS when supported. For Qwen3-Omni pipeline launches,
SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 is set as a default unless the operator has
already set that environment variable. This disables SGLang’s all-M DeepGEMM
precompile session while keeping DeepGEMM available for dense FP8 GEMMs.
To opt back into SGLang’s all-M DeepGEMM precompile behavior:
SGLANG_JIT_DEEPGEMM_PRECOMPILE=1 sgl-omni serve \
--config examples/configs/qwen3_omni_fp8_colocated.yaml \
--colocate \
--model-name qwen3-omni \
--port 8008
Image and Text Input#
Send an image with a text question to get both text and audio responses. Set "modalities": ["text", "audio"] to enable audio output.
cURL
curl -X POST http://localhost:8008/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-omni",
"messages": [{"role": "user", "content": "How many cars are there in the picture?"}],
"images": ["tests/data/cars.jpg"],
"modalities": ["text", "audio"],
"max_tokens": 16
}'
Python
import base64
import requests
resp = requests.post(
"http://localhost:8008/v1/chat/completions",
json={
"model": "qwen3-omni",
"messages": [{"role": "user", "content": "How many cars are there in the picture?"}],
"images": ["tests/data/cars.jpg"],
"modalities": ["text", "audio"],
"max_tokens": 16,
},
)
resp.raise_for_status()
result = resp.json()
choice = result["choices"][0]["message"]
print(choice["content"])
audio_data = base64.b64decode(choice["audio"]["data"])
with open("output.wav", "wb") as f:
f.write(audio_data)
Audio and Image Input#
Send an audio file with an image. The model hears the spoken question and sees the image, then responds with both text and audio.
cURL
curl -X POST http://localhost:8008/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-omni",
"messages": [{"role": "user", "content": ""}],
"images": ["tests/data/cars.jpg"],
"audios": ["tests/data/query_to_cars.wav"],
"modalities": ["text", "audio"],
"max_tokens": 16
}'
Python
import base64
import requests
resp = requests.post(
"http://localhost:8008/v1/chat/completions",
json={
"model": "qwen3-omni",
"messages": [{"role": "user", "content": ""}],
"images": ["tests/data/cars.jpg"],
"audios": ["tests/data/query_to_cars.wav"],
"modalities": ["text", "audio"],
"max_tokens": 16,
},
)
resp.raise_for_status()
result = resp.json()
choice = result["choices"][0]["message"]
print(choice["content"])
audio_data = base64.b64decode(choice["audio"]["data"])
with open("output.wav", "wb") as f:
f.write(audio_data)
Video and Audio Input#
Send a video with a spoken audio question. The model watches the video, hears the question, and responds with both text and audio.
cURL
curl -X POST http://localhost:8008/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-omni",
"messages": [{"role": "user", "content": ""}],
"videos": ["tests/data/draw.mp4"],
"audios": ["tests/data/query_to_draw.wav"],
"modalities": ["text", "audio"],
"max_tokens": 16
}'
Python
import base64
import requests
resp = requests.post(
"http://localhost:8008/v1/chat/completions",
json={
"model": "qwen3-omni",
"messages": [{"role": "user", "content": ""}],
"videos": ["tests/data/draw.mp4"],
"audios": ["tests/data/query_to_draw.wav"],
"modalities": ["text", "audio"],
"max_tokens": 16,
},
)
resp.raise_for_status()
result = resp.json()
choice = result["choices"][0]["message"]
print(choice["content"])
audio_data = base64.b64decode(choice["audio"]["data"])
with open("output.wav", "wb") as f:
f.write(audio_data)
Request Parameters#
The table below lists all parameters accepted by the /v1/chat/completions endpoint for Qwen3-Omni.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
string |
|
Model identifier |
|
list |
(required) |
List of chat messages, each with |
|
list |
|
Output modalities: |
|
list |
|
List of image file paths (local paths or URLs) |
|
list |
|
List of audio file paths (local paths or URLs) |
|
list |
|
List of video file paths (local paths or URLs) |
|
int |
|
Maximum number of tokens to generate |
|
float |
|
Sampling temperature |
|
float |
|
Top-p sampling |
|
int |
|
Top-k sampling |
|
float |
|
Repetition penalty |
|
int |
|
Random seed for reproducibility |
|
bool |
|
Enable streaming via SSE |
|
dict |
|
Audio output configuration, e.g. |
|
dict |
|
Per-stage sampling overrides, e.g. |