Voxtral TTS#
Voxtral-4B-TTS is an open-weights
text-to-speech model from Mistral AI built on a Ministral-3B backbone. It generates lifelike
24 kHz speech with natural prosody across 9 languages and ships with a set of preset named
voices. In SGLang-Omni, Voxtral runs as a preprocessing → tts_generation → vocoder pipeline
and is served through the OpenAI-compatible /v1/audio/speech endpoint.
Prerequisites#
Install sglang-omni by following Installation, then install the Voxtral-specific tokenizer and download the model:
# Voxtral preprocessing uses Mistral's Tekken tokenizer from mistral-common.
uv pip install 'mistral-common[audio]>=1.8.0'
hf download mistralai/Voxtral-4B-TTS-2603
The model repository is public, so no Hugging Face token is required.
Server Configuration#
The pipeline is preprocessing → tts_generation → vocoder.
sgl-omni serve \
--model-path mistralai/Voxtral-4B-TTS-2603 \
--config examples/configs/voxtral_tts.yaml \
--port 8000
Synthesizing Speech#
Zero-shot#
With no voice specified, Voxtral falls back to its default voice (cheerful_female).
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "SGLang-Omni is a great project!"}' \
--output output.wav
Named Voices#
Voxtral speaks with preset named voices (it does not clone from a reference clip). Select
one with the voice field:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Get the trust fund to the bank early.",
"voice": "casual_male",
"max_new_tokens": 4096
}' \
--output output.wav
The available voices ship inside the checkpoint as voice_embedding/*.pt files. List them
from your downloaded snapshot:
ls "$(hf download mistralai/Voxtral-4B-TTS-2603)/voice_embedding"
Python#
import requests
resp = requests.post(
"http://localhost:8000/v1/audio/speech",
json={
"input": "Get the trust fund to the bank early.",
"voice": "casual_male",
"max_new_tokens": 4096,
},
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
f.write(resp.content)
Streaming#
Set "stream": true to receive audio chunks in real time over Server-Sent Events (SSE):
curl -N -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Get the trust fund to the bank early.",
"voice": "casual_male",
"stream": true
}'
Each event carries a base64-encoded audio chunk; the stream ends with data: [DONE]. See the
Higgs TTS cookbook for a full Python SSE consumer.
Request Parameters#
Parameter |
Default |
Notes |
|---|---|---|
|
(required) |
Text to synthesize |
|
|
Preset voice name from the checkpoint’s |
|
|
Maximum number of generated acoustic tokens |
|
|
Output container |
|
|
Stream audio chunks over SSE |
Voxtral generation is deterministic: the engine fixes
temperatureto0.0, so sampling parameters such astop_p,top_k, andtemperatureare not used. Reference-clip voice cloning (references) is not supported for Voxtral — use a presetvoiceinstead.
Benchmark Results#
Seed-TTS EN (full set, 1088 utterances), bf16, max_new_tokens=4096,
--no-ref-audio --voice cheerful_female, concurrency 16, WER scored with HF
Whisper-large-v3. Hardware: 1× H200 SXM.
Metric |
Value |
|---|---|
WER (corpus micro-avg) |
1.20% |
WER (per-sample mean / median) |
1.22% / 0.00% |
WER (per-sample p95 / max) |
9.09% / 42.86% |
>50% WER samples |
0 / 1088 |
Latency mean / median (s) |
2.94 / 2.86 |
Latency p95 / p99 (s) |
4.56 / 5.37 |
RTF mean / median |
0.519 / 0.541 |
Output throughput (tok/s) |
383.7 |
Throughput (req/s) |
5.40 |
Completed / failed requests |
1088 / 0 |
Reproduce with the SeedTTS command in our seedTTS benchmark. The Voxtral model card also quotes ~70 ms first-audio latency at concurrency 1; the table above is a throughput-oriented run at concurrency 16, so its RTF reflects batched load rather than the latency-optimized single-stream figure. Output is 24 kHz.
Known Limitations#
Preset voices only. Voxtral selects from named voices baked into the checkpoint; it does not clone an arbitrary speaker from a reference clip in this engine.
Deterministic decoding.
temperatureis fixed at0.0; you cannot trade determinism for diversity through sampling parameters.Language coverage. Quality is tuned for the 9 supported languages (English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, Hindi).
Non-commercial license. The weights are CC BY-NC 4.0; commercial use is not permitted.