Voxtral TTS#

Voxtral-4B-TTS is an open-weights text-to-speech model from Mistral AI built on a Ministral-3B backbone. It generates lifelike 24 kHz speech with natural prosody across 9 languages and ships with a set of preset named voices. In SGLang-Omni, Voxtral runs as a preprocessing → tts_generation → vocoder pipeline and is served through the OpenAI-compatible /v1/audio/speech endpoint.

Prerequisites#

Install sglang-omni by following Installation, then install the Voxtral-specific tokenizer and download the model:

# Voxtral preprocessing uses Mistral's Tekken tokenizer from mistral-common.
uv pip install 'mistral-common[audio]>=1.8.0'

hf download mistralai/Voxtral-4B-TTS-2603

The model repository is public, so no Hugging Face token is required.

Server Configuration#

The pipeline is preprocessing → tts_generation → vocoder.

sgl-omni serve \
  --model-path mistralai/Voxtral-4B-TTS-2603 \
  --config examples/configs/voxtral_tts.yaml \
  --port 8000

Synthesizing Speech#

Zero-shot#

With no voice specified, Voxtral falls back to its default voice (cheerful_female).

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "SGLang-Omni is a great project!"}' \
  --output output.wav

Named Voices#

Voxtral speaks with preset named voices (it does not clone from a reference clip). Select one with the voice field:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Get the trust fund to the bank early.",
    "voice": "casual_male",
    "max_new_tokens": 4096
  }' \
  --output output.wav

The available voices ship inside the checkpoint as voice_embedding/*.pt files. List them from your downloaded snapshot:

ls "$(hf download mistralai/Voxtral-4B-TTS-2603)/voice_embedding"

Python#

import requests

resp = requests.post(
    "http://localhost:8000/v1/audio/speech",
    json={
        "input": "Get the trust fund to the bank early.",
        "voice": "casual_male",
        "max_new_tokens": 4096,
    },
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
    f.write(resp.content)

Streaming#

Set "stream": true to receive audio chunks in real time over Server-Sent Events (SSE):

curl -N -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Get the trust fund to the bank early.",
    "voice": "casual_male",
    "stream": true
  }'

Each event carries a base64-encoded audio chunk; the stream ends with data: [DONE]. See the Higgs TTS cookbook for a full Python SSE consumer.

Request Parameters#

Parameter

Default

Notes

input

(required)

Text to synthesize

voice

cheerful_female

Preset voice name from the checkpoint’s voice_embedding/ directory

max_new_tokens

4096

Maximum number of generated acoustic tokens

response_format

wav

Output container

stream

false

Stream audio chunks over SSE

Voxtral generation is deterministic: the engine fixes temperature to 0.0, so sampling parameters such as top_p, top_k, and temperature are not used. Reference-clip voice cloning (references) is not supported for Voxtral — use a preset voice instead.

Benchmark Results#

Seed-TTS EN (full set, 1088 utterances), bf16, max_new_tokens=4096, --no-ref-audio --voice cheerful_female, concurrency 16, WER scored with HF Whisper-large-v3. Hardware: 1× H200 SXM.

Metric

Value

WER (corpus micro-avg)

1.20%

WER (per-sample mean / median)

1.22% / 0.00%

WER (per-sample p95 / max)

9.09% / 42.86%

>50% WER samples

0 / 1088

Latency mean / median (s)

2.94 / 2.86

Latency p95 / p99 (s)

4.56 / 5.37

RTF mean / median

0.519 / 0.541

Output throughput (tok/s)

383.7

Throughput (req/s)

5.40

Completed / failed requests

1088 / 0

Reproduce with the SeedTTS command in our seedTTS benchmark. The Voxtral model card also quotes ~70 ms first-audio latency at concurrency 1; the table above is a throughput-oriented run at concurrency 16, so its RTF reflects batched load rather than the latency-optimized single-stream figure. Output is 24 kHz.

Known Limitations#

  • Preset voices only. Voxtral selects from named voices baked into the checkpoint; it does not clone an arbitrary speaker from a reference clip in this engine.

  • Deterministic decoding. temperature is fixed at 0.0; you cannot trade determinism for diversity through sampling parameters.

  • Language coverage. Quality is tuned for the 9 supported languages (English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, Hindi).

  • Non-commercial license. The weights are CC BY-NC 4.0; commercial use is not permitted.