Qwen3 TTS#
Qwen3-TTS-12Hz-Base is a discrete
multi-codebook text-to-speech model from the Qwen team. It performs fast voice cloning from a
short reference clip, supports 10 languages, and streams 24 kHz speech with low latency. The
12Hz in the name refers to the codec frame rate (12 acoustic frames per second), not the
playback sample rate. SGLang-Omni serves two checkpoints β 0.6B and 1.7B β through the same
preprocessing β tts_engine β vocoder pipeline and the OpenAI-compatible /v1/audio/speech
endpoint.
Prerequisites#
Install sglang-omni by following Installation.
Qwen3-TTS Base uses the upstream qwen-tts package. Install it without
dependencies so the SGLang-Omni Transformers 5.6 / SGLang 0.5.12.post1 stack remains
in place:
apt-get update && apt-get install -y sox
uv pip install sox einops onnxruntime
uv pip install --no-deps qwen-tts==0.1.1
Do not install
qwen-ttswith dependencies here. Its declared dependency set can pull a different Transformers/Torch stack than the SGLang-Omni runtime.
The Python sox package shells out to the system sox binary on some paths, so install both.
Download a checkpoint (both repositories are public, no token required):
hf download Qwen/Qwen3-TTS-12Hz-0.6B-Base
hf download Qwen/Qwen3-TTS-12Hz-1.7B-Base
Server Configuration#
The pipeline is preprocessing β tts_engine β vocoder. First startup can take several minutes
while the tts_engine captures CUDA graphs.
# 0.6B
sgl-omni serve \
--model-path Qwen/Qwen3-TTS-12Hz-0.6B-Base \
--config examples/configs/qwen3_tts_0_6b.yaml \
--port 8000
# 1.7B
sgl-omni serve \
--model-path Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--config examples/configs/qwen3_tts_1_7b.yaml \
--port 8000
Synthesizing Speech#
Text-only Requests#
Qwen3-TTS Base checkpoints require a reference clip. Text-only requests are supported by CustomVoice and VoiceDesign checkpoints; see TTS Model Usage for those launch commands.
Voice Cloning#
The references field accepts audio_path (a local path or HTTP URL) and text (the
transcript of that clip). Supplying the transcript enables in-context-learning (ICL) mode and
materially improves cloning quality; omitting it falls back to speaker-embedding (x-vector)
mode.
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "SGLang-Omni is a great project!",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"text": "We asked over twenty different people, and they all said it was his."
}]
}' \
--output output.wav
ref_audio and ref_text are accepted as shorthand for references[0].audio_path and
references[0].text.
Python#
import requests
resp = requests.post(
"http://localhost:8000/v1/audio/speech",
json={
"input": "Get the trust fund to the bank early.",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"text": "We asked over twenty different people, and they all said it was his.",
}],
},
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
f.write(resp.content)
Language Hint#
language biases the model toward a target language. It defaults to auto (let the model
detect). Supported languages are Chinese, English, Japanese, Korean, German, French, Russian,
Portuguese, Spanish, and Italian.
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "δ»ε€©ε€©ζ°δΈιοΌε°±θ―₯εΊε»ζζε€ͺι³γ",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"text": "We asked over twenty different people, and they all said it was his."
}],
"language": "Chinese"
}' \
--output output.wav
Streaming#
Set "stream": true and "response_format": "pcm" to receive raw PCM audio
chunks in real time:
curl -N -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Get the trust fund to the bank early.",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"text": "We asked over twenty different people, and they all said it was his."
}],
"stream": true,
"response_format": "pcm"
}' \
--output output.pcm
Streaming returns audio/pcm 16-bit mono PCM bytes with sample-rate metadata in
the response headers. See the Higgs TTS cookbook
for a full Python raw PCM consumer.
Generation Parameters#
Parameter |
Default |
Notes |
|---|---|---|
|
(required) |
Text to synthesize |
|
|
Reference clip for cloning; each item has |
|
|
Shorthand for |
|
|
Target-language hint (see list above) |
|
|
Sampling temperature |
|
|
Top-p sampling |
|
|
Top-k sampling |
|
|
Repetition penalty |
|
|
Maximum number of generated codec tokens |
|
|
Random seed for reproducibility |
|
|
Stream raw PCM audio chunks |
Model Variants#
Checkpoint |
Parameters |
Config |
|---|---|---|
|
0.6B |
|
|
1.7B |
|
Both expose an identical request API. The 1.7B model has higher capacity (typically better quality) at a larger memory and latency cost; the 0.6B model is lighter and faster.
Benchmark Results#
Qwen3-TTS-12Hz-0.6B-Base on Seed-TTS EN (1088 utterances, reference voice cloning from each prompt), concurrency 16, WER scored with HF Whisper-large-v3. Hardware: 1Γ H200 SXM.
Metric |
Value |
|---|---|
WER (corpus, excl. runaway outliers) |
1.07% |
WER (per-sample median / p95) |
0.00% / 9.09% |
WER (corpus micro-avg, raw) |
18.29% |
Runaway samples (>50% WER) |
2 / 1088 (0.2%) |
Latency mean / median (s) |
6.61 / 6.24 |
RTF mean / median |
1.51 / 1.48 |
Output throughput (tok/s) |
115.4 |
Completed / failed requests |
1088 / 0 |
Typical output is clean (0.00% median WER, 9.09% p95). Two utterances (0.2%) ran away into a
repetition loop and generated ~164 s of looping audio up to max_new_tokens, which alone lifts
the raw micro-average to 18.29%; excluding those, corpus WER is 1.07%. RTF > 1 reflects the
0.6B codec pipeline at concurrency 16, not single-stream latency. The 1.7B checkpoint trades
latency for quality.
Known Limitations#
Reference audio recommended. As a cloning model, Qwen3-TTS Base produces robotic speech without a reference clip.
Transcript improves cloning. Providing
textinreferences(ICL mode) yields better speaker similarity than speaker-embedding-only (x-vector) mode.Language detection.
language: automay misdetect for short or code-switched inputs; setlanguageexplicitly when you know the target language.Rare runaway generation. Roughly 0.2% of utterances (observed on the 0.6B checkpoint) can fall into a repetition loop and keep generating up to
max_new_tokens. Raisingrepetition_penalty(default1.05) or loweringmax_new_tokensmitigates it; the 1.7B checkpoint is less prone.