Fish Audio S2-Pro#
Fish Audio S2-Pro is a text-to-speech model served through /v1/audio/speech. It supports plain TTS, voice cloning with a reference clip, and streaming audio chunks.
Prerequisites#
Install sglang-omni by following Installation, then download the model:
hf download fishaudio/s2-pro
Server Configuration#
sgl-omni serve \
--model-path fishaudio/s2-pro \
--config examples/configs/s2pro_tts.yaml \
--port 8000
Synthesize Speech#
Plain TTS:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello, how are you?"}' \
--output output.wav
Voice cloning:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Get the trust fund to the bank early.",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"text": "We asked over twenty different people, and they all said it was his."
}]
}' \
--output output.wav
Streaming:
curl -N -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Get the trust fund to the bank early.",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"text": "We asked over twenty different people, and they all said it was his."
}],
"stream": true,
"response_format": "pcm"
}' \
--output output.pcm
Request Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
string |
required |
Text to synthesize |
|
string |
|
Voice identifier for non-reference requests |
|
string |
|
Output audio format |
|
float |
|
Playback speed multiplier |
|
bool |
|
Stream raw PCM audio chunks |
|
list |
|
Reference clip for voice cloning; each item has |
|
string |
|
Shorthand for |
|
int |
|
Maximum generated semantic tokens |
|
float |
|
Sampling temperature |
|
float |
|
Top-p sampling |
|
int |
|
Top-k sampling; must be |
|
float |
|
Repetition penalty |
Known Limitations#
top_kis constrained to-1or1..30; keep requests inside this range because invalid values currently fail the S2-Pro pipeline instead of returning a clean parameter error.Reference quality strongly affects cloned voice quality.
Use streaming for interactive playback; CLI inspection of raw audio responses is awkward.