Higgs TTS#
Higgs Audio v3 TTS is a chat-native text-to-speech model from Boson AI built on a Qwen3-4B backbone. It generates 24 kHz speech through 8 discrete codebooks and supports 100+ languages, voice cloning from a reference clip, and fine-grained inline control over emotion, style, sound effects, and prosody.
Highlights#
Chat-native, low-latency streaming multi-turn speech generation
Multilingual — 100+ languages and dialects, 90+ with single-digit WER/CER
Voice clone accuracy — high-fidelity zero-shot speaker cloning from reference clips
Inline control via
<|emotion:…|>,<|style:…|>,<|sfx:…|>,<|prosody:…|>tags
Architecture#

Higgs autoregressive decoder consumes interleaved text and audio tokens. Audio is encoded by the Higgs Tokenizer into 8 codebooks at 25 fps, staggered via a delay pattern, then mapped to backbone hidden states through a multi-codebook fused embedding. Output codes pass through a multi-codebook fused head, are de-delayed, and decoded back to waveform. Multi-turn generation interleaves <|text|>…<|audio|>… chunks so each new chunk is grounded on reference + prior chunks.
Component |
Spec |
|---|---|
Backbone |
~4B autoregressive decoder (36 L, hidden=2560, GQA 32/8) |
Audio tokens |
8 codebooks × 1026 vocab, delay pattern |
Multi-codebook embedding / head |
Fused single-tensor, tied with text embedding |
Context length |
8,192 tokens (training sequence length) |
Prerequisites#
Install sglang-omni by following Installation, then download the model:
# Higgs TTS model is private; export your HF token before downloading.
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hf download boson-sglang/higgs-audio-v3-TTS-4B-grpo05200410999
hf download bosonai/higgs-audio-v2-tokenizer
Server Configuration#
The pipeline is preprocessing → audio_encoder → tts_engine → vocoder.
sgl-omni serve \
--model-path boson-sglang/higgs-audio-v3-TTS-4B-grpo05200410999 \
--port 8000
Synthesizing Speech#
Zero-shot#
Use curl
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello, how are you?"}' \
--output output.wav
Use Python
import requests
resp = requests.post(
"http://localhost:8000/v1/audio/speech",
json={"input": "Hello, how are you?"},
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
f.write(resp.content)
Reference output:
Voice Cloning#
Supplying the reference transcript (text) materially improves cloning quality.
Use curl
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Have a nice day and enjoy south california sunshine.",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"text": "We asked over twenty different people, and they all said it was his."
}],
"temperature": 0.8,
"top_k": 50,
"max_new_tokens": 1024
}' \
--output output.wav
Use Python
import requests
resp = requests.post(
"http://localhost:8000/v1/audio/speech",
json={
"input": "Have a nice day and enjoy south california sunshine.",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"text": "We asked over twenty different people, and they all said it was his.",
}],
"temperature": 0.8,
"top_k": 50,
"max_new_tokens": 1024,
},
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
f.write(resp.content)
Reference input:
Reference output:
Streaming#
Unlike a standard request where you wait for the full audio to be generated before receiving anything, streaming lets you start receiving and playing audio while generation is still in progress. This significantly reduces time-to-first-audio, which matters for real-time or interactive use cases.
Higgs TTS implements streaming via Server-Sent Events (SSE). Each SSE event carries a base64-encoded WAV chunk. Your client can decode and play each chunk as it arrives, rather than buffering the entire response.
Enable streaming by setting "stream": true in the request body. During generation, the vocoder emits incremental audio chunks; the terminal event is intentionally slim and carries metadata such as sample_rate and usage instead of repeating the full waveform. Inside the pipeline, audio chunks use the compact audio_waveform payload (bytes plus audio_waveform_shape, audio_waveform_dtype, and sample_rate), which the HTTP layer encodes into the SSE audio.data field.
Use curl
Set "stream": true in your request body:
curl -N -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Get the trust fund to the bank early.",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"text": "We asked over twenty different people, and they all said it was his."
}],
"stream": true
}'
The -N flag disables curl’s output buffering so SSE events are printed as they arrive.
Use Python
This example decodes each chunk and writes it to a WAV file incrementally. In a real application, you would pipe the decoded bytes directly to an audio player (e.g., via pyaudio or sounddevice).
import requests
import base64
import json
REFERENCE_AUDIO = "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav"
REFERENCE_TEXT = "We asked over twenty different people, and they all said it was his."
SPEECH_INPUT = "Get the trust fund to the bank early."
with requests.post(
"http://localhost:8000/v1/audio/speech",
json={
"input": SPEECH_INPUT,
"references": [{"audio_path": REFERENCE_AUDIO, "text": REFERENCE_TEXT}],
"stream": True,
},
stream=True,
) as resp:
resp.raise_for_status()
with open("output_streaming.wav", "wb") as f:
for line in resp.iter_lines():
if not line or line == b"data: [DONE]":
continue
if not line.startswith(b"data: "):
continue
event = json.loads(line[len(b"data: "):])
if event.get("finish_reason") == "stop":
break
audio_data = event.get("audio") or {}
if audio_data.get("data"):
chunk = base64.b64decode(audio_data["data"])
f.write(chunk)
# In a real app: feed `chunk` to your audio player here
Reference output:
What the SSE response looks like#
Each event follows the standard SSE format:
data: {"id": "speech-...", "object": "audio.speech.chunk", "index": 0, "audio": {"data": "<base64-encoded WAV bytes>", "format": "wav", ...}, "finish_reason": null}
data: {"id": "speech-...", "object": "audio.speech.chunk", "index": 1, "audio": null, "finish_reason": "stop", "usage": {...}}
data: [DONE]
Audio chunks have "finish_reason": null and carry audio data in audio.data. The final metadata event has "finish_reason": "stop" and "audio": null, followed by a [DONE] sentinel.
Inline Control Tokens#
Embed control tokens directly in the input field. Tokens from different
categories can be combined:
Demo
Emotion: surprise
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "I cant believe it! <|emotion:surprise|> <|prosody:pause|> <|style:whispering|> Higgs Model and SGLang are absolutely incredible."
}' \
--output output.wav
Reference output:
Prosody: speed_slow
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "<|emotion:enthusiasm|> Welcome to the show! <|prosody:pause|> <|prosody:speed_slow|> Today we have something truly special for you."
}' \
--output output.wav
Reference output:
Combine them together:
Here is an example of combining emotion, prosody and style tokens together:
Commands
Part 1 — female asks:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "<|prosody:pitch_high|> <|prosody:speed_slow|> Excuse me. Can you tell me how much the shirt is?",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_103675.wav",
"text": "Excuse me. Can you tell me how much the shirt is?"
}],
"temperature": 0.5,
"top_k": 30,
"seed": 404
}' \
--output part1.wav
Part 2 — male answers:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "<|prosody:speed_very_slow|> <|prosody:expressive_low|> Yes, it is nine fifteen.",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"text": "We asked over twenty different people, and they all said it was his."
}],
"temperature": 0.5,
"top_k": 30,
"seed": 43
}' \
--output part2.wav
Part 3 — female reads the question:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "<|prosody:speed_slow|> <|prosody:expressive_low|> Question: How much is the shirt?",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_103675.wav",
"text": "We asked over twenty different people, and they all said it was his."
}],
"temperature": 0.5,
"top_k": 30,
"seed": 44
}' \
--output part3.wav
Concatenate (~0.6 s gap between lines):
ffmpeg -y \
-i part1.wav -f lavfi -t 0.6 -i anullsrc=r=24000:cl=mono \
-i part2.wav -f lavfi -t 0.6 -i anullsrc=r=24000:cl=mono \
-i part3.wav \
-filter_complex "[0:a][1:a][2:a][3:a][4:a]concat=n=5:v=0:a=1" \
gaokao_listening.wav
Reference output:
Emotion#
Token |
Description |
|---|---|
|
Elation / joy |
|
Amusement / playful laughter |
|
Enthusiasm / excitement |
|
Determination / firmness |
|
Pride / confidence |
|
Calm satisfaction |
|
Warmth / affection |
|
Relief |
|
Thoughtful / reflective |
|
Confused |
|
Surprised |
|
Awe / wonder |
|
Longing / yearning |
|
Heightened desire |
|
Anger |
|
Fear |
|
Disgust |
|
Bitterness |
|
Sadness |
|
Shame |
|
Helplessness |
Style#
Token |
Description |
|---|---|
|
Singing |
|
Shouting / projected voice |
|
Whisper |
Sound Effects#
Token |
Description |
|---|---|
|
Cough |
|
Laughter |
|
Crying |
|
Screaming |
|
Burping |
|
Humming |
|
Sigh |
|
Sniff |
|
Sneeze |
Prosody#
Token |
Effect |
|---|---|
|
~0.65× speed |
|
~0.85× speed |
|
~1.2× speed |
|
~1.4× speed |
|
~−3 semitones |
|
~+2.5 semitones |
|
~400–700 ms pause |
|
~700–1500 ms pause |
|
More expressive delivery |
|
Flatter delivery |
Pre-encoded reference codes#
For high-throughput pipelines (e.g. RL rollout) where the same reference audio is reused across many requests, you can encode the reference audio offline and pass the discrete codes directly via reference_codes — this skips the server-side codec encode step. Shape must be [T, num_codebooks=8].
# python
resp = requests.post(
"http://localhost:8000/v1/audio/speech",
json={
"input": SPEECH_INPUT,
"reference_codes": codes_TN, # [T, 8] int list, pre-delay-pattern
"reference_text": REFERENCE_TEXT,
},
)
Request parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
string |
(required) |
Text to synthesize |
|
string |
|
Voice identifier (ignored when |
|
string |
|
Output audio format |
|
bool |
|
Enable streaming via SSE |
|
list |
|
Reference audio for voice cloning; each item has |
|
list[list[int]] |
|
Pre-encoded discrete codes, shape |
|
string |
|
Transcript of reference audio when supplying |
|
int |
|
Maximum number of generated multi-codebook steps |
|
float |
|
Sampling temperature |
|
float |
|
Top-p sampling |
|
int |
|
Top-k sampling |
|
int |
|
Random seed for reproducibility |
Throughput#
[TODO (yichi, Huapeng): This should be updated in the last minute.]
Throughput on seed-tts en (N=50 per concurrency, sequential thread pool, A100 40GB, bf16):
Concurrency |
Mean latency |
RTF (per-req) |
audio_s/s |
|---|---|---|---|
1 |
4637 ms |
0.526 |
1.90 |
16 |
7138 ms |
0.747 |
12.88 |
32 |
10188 ms |
0.865 |
16.94 |
Evaluation Benchmarks#
We report WER / CER (↓, %) and WavLM speaker similarity (↑, ×100) on three zero-shot voice-cloning benchmarks.
Seed-TTS#
Lang |
WER ↓ |
SIM ↑ |
|---|---|---|
en |
2.05 |
64.86 |
zh |
2.00 |
70.96 |
macro |
2.02 |
67.91 |
CV3 (9 langs)#
Lang |
WER ↓ |
SIM ↑ |
|---|---|---|
de |
8.62 |
65.43 |
en |
6.73 |
60.37 |
es |
5.03 |
68.18 |
fr |
14.50 |
62.34 |
it |
8.55 |
67.34 |
ja |
7.96 |
67.91 |
ko |
4.38 |
68.40 |
ru |
9.38 |
66.77 |
zh |
5.19 |
69.71 |
macro |
7.82 |
66.27 |
MiniMax-Multilingual (23 langs)#
Lang |
WER ↓ |
SIM ↑ |
|---|---|---|
ar |
2.59 |
74.77 |
cs |
4.62 |
78.80 |
de |
0.74 |
70.65 |
el |
1.81 |
78.02 |
en |
1.87 |
81.32 |
es |
3.06 |
72.78 |
fi |
4.62 |
82.69 |
fr |
4.70 |
70.27 |
hi |
6.81 |
80.94 |
id |
2.38 |
72.42 |
it |
2.07 |
74.56 |
ja |
3.74 |
74.23 |
ko |
3.57 |
74.86 |
nl |
2.10 |
73.02 |
pl |
2.08 |
83.16 |
pt |
2.59 |
76.52 |
ro |
3.64 |
77.10 |
ru |
4.66 |
74.48 |
th |
7.59 |
77.64 |
tr |
2.09 |
77.72 |
uk |
2.69 |
71.79 |
vi |
1.18 |
73.46 |
zh |
1.65 |
74.85 |
macro |
3.17 |
75.92 |