TTS Model Usage#
This guide uses Fish Speech S2-Pro as an example TTS (text-to-speech) model with SGLang-Omni and the OpenAI-compatible API. The same /v1/audio/speech endpoint also supports Voxtral TTS and Qwen3-TTS.
Prerequisites#
Install sglang-omni by following Installation, then download the model:
hf download fishaudio/s2-pro
Qwen3-TTS uses the upstream qwen-tts package, which currently requires
Transformers 4.57.3. Install it only in environments that serve Qwen3-TTS:
uv pip install --upgrade transformers==4.57.3 accelerate==1.12.0 sox einops
uv pip install --no-deps qwen-tts==0.1.1
Supported TTS Models#
Model family |
Example config |
Request notes |
|---|---|---|
Fish Speech S2-Pro |
|
Supports plain TTS and voice cloning with |
|
Uses |
|
|
Requires reference audio through |
|
Qwen3-TTS CustomVoice |
|
Text-only requests use the checkpoint speaker table; missing |
Qwen3-TTS VoiceDesign |
|
Requires |
|
Voice cloning via |
Launch the Server#
sgl-omni serve \
--model-path fishaudio/s2-pro \
--config examples/configs/s2pro_tts.yaml \
--port 8000
For Voxtral:
sgl-omni serve \
--model-path mistralai/Voxtral-4B-TTS-2603 \
--config examples/configs/voxtral_tts.yaml \
--port 8000
For Qwen3-TTS Base:
sgl-omni serve \
--model-path Qwen/Qwen3-TTS-12Hz-0.6B-Base \
--config examples/configs/qwen3_tts_0_6b.yaml \
--port 8000
For Qwen3-TTS CustomVoice:
sgl-omni serve \
--model-path Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice \
--config examples/configs/qwen3_tts_0_6b_customvoice.yaml \
--port 8000
For Qwen3-TTS VoiceDesign:
sgl-omni serve \
--model-path Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \
--config examples/configs/qwen3_tts_1_7b_voicedesign.yaml \
--port 8000
For MOSS-TTS:
sgl-omni serve \
--model-path OpenMOSS-Team/MOSS-TTS-v1.5 \
--config examples/configs/moss_tts.yaml \
--port 8000
Use Curl#
Generate speech from text without any reference audio. This is valid for Qwen3-TTS CustomVoice, Voxtral, and S2-Pro. It is not valid for Qwen3-TTS Base.
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello, how are you?"}' \
--output output.wav
Qwen3-TTS Base requires reference audio:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Get the trust fund to the bank early.",
"ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"ref_text": "We asked over twenty different people, and they all said it was his."
}' \
--output output.wav
Qwen3-TTS VoiceDesign uses text plus voice instructions:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, how are you?",
"task_type": "VoiceDesign",
"instructions": "A warm, natural young adult voice."
}' \
--output output.wav
For natural-sounding Fish Speech S2-Pro results, use Voice Cloning with a reference audio clip.
Voice Cloning#
The examples below use a sample clip from seed-tts-eval-mini. The references field accepts audio_path (a local path or HTTP URL) and text (transcript of that audio).
Non-streaming request
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Get the trust fund to the bank early.",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"text": "We asked over twenty different people, and they all said it was his."
}]
}' \
--output output.wav
Streaming
Enable streaming to receive audio chunks in real time via Server-Sent Events (SSE). Set "stream": true:
curl -N -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Get the trust fund to the bank early.",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"text": "We asked over twenty different people, and they all said it was his."
}],
"stream": true
}'
The server returns a stream of SSE events. Each event contains an audio.speech.chunk object with a base64-encoded audio chunk. The stream ends with data: [DONE].
Use Python#
Basic TTS#
This no-reference request applies to Fish Speech S2-Pro and Voxtral TTS.
import requests
resp = requests.post(
"http://localhost:8000/v1/audio/speech",
json={"input": "Hello, how are you?"},
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
f.write(resp.content)
Voice Cloning#
REFERENCE_AUDIO = "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav"
REFERENCE_TEXT = "We asked over twenty different people, and they all said it was his."
SPEECH_INPUT = "Get the trust fund to the bank early."
Non-streaming Request
import requests
resp = requests.post(
"http://localhost:8000/v1/audio/speech",
json={
"input": SPEECH_INPUT,
"references": [{"audio_path": REFERENCE_AUDIO, "text": REFERENCE_TEXT}],
},
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
f.write(resp.content)
Streaming Request
import base64, io, json, wave
import requests
payload = {
"input": SPEECH_INPUT,
"references": [{"audio_path": REFERENCE_AUDIO, "text": REFERENCE_TEXT}],
"stream": True,
"response_format": "wav",
}
chunks = []
fmt = None
with requests.post(
"http://localhost:8000/v1/audio/speech",
json=payload,
stream=True,
timeout=600,
) as stream:
stream.raise_for_status()
for line in stream.iter_lines(decode_unicode=True):
if not line or not line.startswith("data: "):
continue
data = line[len("data:"):].lstrip()
if data == "[DONE]":
break
b64 = (json.loads(data).get("audio") or {}).get("data")
if not b64:
continue
with wave.open(io.BytesIO(base64.b64decode(b64)), "rb") as w:
if fmt is None:
fmt = w.getnchannels(), w.getsampwidth(), w.getframerate()
chunks.append(w.readframes(w.getnframes()))
assert fmt
nc, sw, fr = fmt
with wave.open("output_stream.wav", "wb") as w:
w.setnchannels(nc)
w.setsampwidth(sw)
w.setframerate(fr)
w.writeframes(b"".join(chunks))
Request Parameters#
The table below lists all parameters accepted by the /v1/audio/speech endpoint.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
string |
(required) |
Text to synthesize |
|
string |
|
Voice identifier |
|
string |
|
Output audio format |
|
float |
|
Playback speed multiplier |
|
bool |
|
Enable streaming via SSE |
|
list |
|
Reference audio for voice cloning; each item has |
|
string |
|
Reference audio path / URL / base64 string; equivalent to |
|
string |
|
Transcript for |
|
string |
|
Model-specific language hint; Qwen3-TTS Base defaults to |
|
string |
|
Qwen3-TTS task type: |
|
string |
|
Qwen3-TTS style or VoiceDesign instructions |
|
int |
|
Maximum number of generated tokens |
|
float |
|
Sampling temperature |
|
float |
|
Top-p sampling |
|
int |
|
Top-k sampling |
|
float |
|
Repetition penalty |
|
int |
|
Model-specific; Qwen3-TTS Base accepts request-scoped seed, Voxtral TTS currently rejects seed |
H200 SeedTTS Benchmark Commands#
Download the full SeedTTS set first:
python -m benchmarks.dataset.prepare --dataset seedtts
Run EN and ZH after launching the target server on port 8000. Do not add benchmark results to docs until the full H200 runs complete.
python -m benchmarks.eval.benchmark_tts_seedtts \
--meta zhaochenyang20/seed-tts-eval-arrow \
--model Qwen/Qwen3-TTS-12Hz-0.6B-Base \
--port 8000 \
--output-dir results/qwen3_tts_0_6b_en \
--lang en \
--max-concurrency 16
python -m benchmarks.eval.benchmark_tts_seedtts \
--meta zhaochenyang20/seed-tts-eval-arrow \
--model Qwen/Qwen3-TTS-12Hz-0.6B-Base \
--port 8000 \
--output-dir results/qwen3_tts_0_6b_zh \
--lang zh \
--max-concurrency 16
python -m benchmarks.eval.benchmark_tts_seedtts \
--meta zhaochenyang20/seed-tts-eval-arrow \
--model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--port 8000 \
--output-dir results/qwen3_tts_1_7b_en \
--lang en \
--max-concurrency 16
python -m benchmarks.eval.benchmark_tts_seedtts \
--meta zhaochenyang20/seed-tts-eval-arrow \
--model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--port 8000 \
--output-dir results/qwen3_tts_1_7b_zh \
--lang zh \
--max-concurrency 16
python -m benchmarks.eval.benchmark_tts_seedtts \
--meta zhaochenyang20/seed-tts-eval-arrow \
--model mistralai/Voxtral-4B-TTS-2603 \
--port 8000 \
--output-dir results/voxtral_en \
--lang en \
--max-new-tokens 4096 \
--max-concurrency 16 \
--no-ref-audio \
--voice cheerful_female
python -m benchmarks.eval.benchmark_tts_seedtts \
--meta zhaochenyang20/seed-tts-eval-arrow \
--model mistralai/Voxtral-4B-TTS-2603 \
--port 8000 \
--output-dir results/voxtral_zh \
--lang zh \
--max-new-tokens 4096 \
--max-concurrency 16 \
--no-ref-audio \
--voice cheerful_female
Interactive Playground#
SGLang-Omni ships with a Gradio-based playground for interactive TTS experimentation:
./playground/s2pro/start.sh
The playground now exposes two demo modes against the same S2 Pro backend:
Non-Streamingstarts a standard request and shows the final WAV after generation finishes.Streamingconsumes the/v1/audio/speechSSE stream, starts playback from incremental WAV chunks, and also writes a final combined WAV artifact for inspection.
The launcher starts the backend first, waits for /health, then starts the Gradio UI with:
python -m playground.s2pro.app --api-base http://localhost:8000
A demo play video is available here. We highly recommend using playground since audio data is hard to interact with by CLI.