Fish Audio S2-Pro

Contents

Fish Audio S2-Pro#

Fish Audio S2-Pro is a text-to-speech model served through /v1/audio/speech. It supports plain TTS, voice cloning with a reference clip, and streaming audio chunks.

Prerequisites#

Install sglang-omni by following Installation, then download the model:

hf download fishaudio/s2-pro

Server Configuration#

sgl-omni serve \
  --model-path fishaudio/s2-pro \
  --config examples/configs/s2pro_tts.yaml \
  --port 8000

Synthesize Speech#

Plain TTS:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "fishaudio/s2-pro",
    "voice": "default",
    "input": "Hello, how are you?"
  }' \
  --output output.wav

Voice cloning:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "fishaudio/s2-pro",
    "voice": "default",
    "input": "Get the trust fund to the bank early.",
    "references": [{
      "audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
      "text": "We asked over twenty different people, and they all said it was his."
    }]
  }' \
  --output output.wav

Streaming:

curl -N -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "fishaudio/s2-pro",
    "voice": "default",
    "input": "Get the trust fund to the bank early.",
    "references": [{
      "audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
      "text": "We asked over twenty different people, and they all said it was his."
    }],
    "stream": true,
    "response_format": "pcm"
  }' \
  --output output.pcm

Request Parameters#

Parameter	Type	Default	Description
`model`	string	served model	Served model identifier
`input`	string	required	Text to synthesize
`voice`	string	`default`	Voice identifier for non-reference requests
`response_format`	string	`wav`	Output audio format
`speed`	float	`1.0`	Playback speed multiplier
`stream`	bool	`false`	Stream raw PCM audio chunks
`references`	list	`null`	Reference clip for voice cloning. Each item has `audio_path` and `text`
`ref_audio` / `ref_text`	string	`null`	Shorthand for `references[0].audio_path` and `references[0].text`
`max_new_tokens`	int	`2048`	Maximum generated semantic tokens
`temperature`	float	`0.8`	Sampling temperature
`top_p`	float	`0.8`	Top-p sampling
`top_k`	int	`30`	Top-k sampling. It must be `-1` or between `1` and `30`
`repetition_penalty`	float	`1.1`	Repetition penalty

Known Limitations#

top_k is constrained to -1 or 1..30; keep requests inside this range because invalid values currently fail the S2-Pro pipeline instead of returning a clean parameter error.
Reference quality strongly affects cloned voice quality.
Use streaming for interactive playback; CLI inspection of raw audio responses is awkward.