MOSS-TTS#

MOSS-TTS-v1.5 is a delay-pattern text-to-speech model from MOSI.AI and the OpenMOSS team. It reconstructs 24 kHz speech and supports zero-shot voice cloning from reference audio, reference-less synthesis, long-form speech generation, streaming, token-level duration control, Pinyin/IPA pronunciation control, multilingual synthesis, and code-switching. The model supports 31 languages, accepts language tags to guide multilingual generation, and supports inline pause markers such as [pause 3.2s] for explicit prosody control.

MOSS-TTS delay-pattern architecture

Architecturally, MOSS-TTS-v1.5 is the delay-pattern counterpart to MOSS-TTS-Local-Transformer-v1.5. The Qwen3-8B backbone predicts one text stream plus 32 residual-vector-quantization (RVQ) audio codebooks with delay-pattern scheduling; the generated codes are de-delayed and reconstructed into waveform by the vocoder. In SGLang-Omni it runs as a preprocessing → tts_engine → vocoder pipeline served through the OpenAI-compatible /v1/audio/speech endpoint.

Component	Spec
Architecture	`MossTTSDelayModel` (`moss_tts_delay`)
Backbone	Qwen3-8B autoregressive decoder (36 L, hidden=4096, GQA 32/8)
Audio tokens	32-codebook RVQ depth with delay-pattern scheduling
Output audio	24 kHz
Languages	31 languages with optional language tags
Controls	Voice reference, target duration tokens, Pinyin/IPA, pause markers, style instructions

Prerequisites#

Install sglang-omni by following Installation, then download the model (public, no token required):

hf download OpenMOSS-Team/MOSS-TTS-v1.5

The processor ships with the checkpoint, so no extra TTS package is needed. Decoding base64 (data-URI) reference audio additionally requires soundfile (uv pip install soundfile).

Server Configuration#

The pipeline is preprocessing → tts_engine → vocoder.

sgl-omni serve \
  --model-path OpenMOSS-Team/MOSS-TTS-v1.5 \
  --config examples/configs/moss_tts.yaml \
  --port 8000

Synthesizing Speech#

Basic Speech#

MOSS-TTS can synthesize speech without a reference clip:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "OpenMOSS-Team/MOSS-TTS-v1.5",
    "voice": "default",
    "input": "SGLang-Omni is a great project!"
  }' \
  --output output.wav

Voice Cloning#

Provide a reference clip when you want voice cloning. The references field accepts audio_path (a local path, HTTP URL, or base64 data URI) and text (the transcript of that clip). Supplying the transcript materially improves cloning quality.

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "OpenMOSS-Team/MOSS-TTS-v1.5",
    "voice": "default",
    "input": "SGLang-Omni is a great project!",
    "references": [{
      "audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
      "text": "We asked over twenty different people, and they all said it was his."
    }]
  }' \
  --output output.wav

ref_audio and ref_text are accepted as shorthand for references[0].audio_path and references[0].text.

Python#

import requests

resp = requests.post(
    "http://localhost:8000/v1/audio/speech",
    json={
        "model": "OpenMOSS-Team/MOSS-TTS-v1.5",
        "voice": "default",
        "input": "Get the trust fund to the bank early.",
        "ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
        "ref_text": "We asked over twenty different people, and they all said it was his.",
    },
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
    f.write(resp.content)

Reference Audio Sources#

audio_path / ref_audio may be a local filesystem path readable by the server, an HTTP(S) URL, or a base64 data URI (data:audio/wav;base64,<...>, decoded with soundfile):

{"ref_audio": "data:audio/wav;base64,UklGR.....", "ref_text": "Transcript of the clip."}

Streaming#

Set "stream": true and "response_format": "pcm" to receive raw PCM audio chunks in real time.

curl -N -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "OpenMOSS-Team/MOSS-TTS-v1.5",
    "voice": "default",
    "input": "Get the trust fund to the bank early.",
    "ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
    "ref_text": "We asked over twenty different people, and they all said it was his.",
    "stream": true,
    "response_format": "pcm"
  }' \
  --output output.pcm

Duration Control#

MOSS-TTS conditions on a target duration token count (codec frames; a larger count yields longer audio). Set it with an inline ${token:N} prefix on input (stripped before synthesis), or with a token_count (alias duration_tokens / tokens) parameter. The count must be a positive integer.

{
  "model": "OpenMOSS-Team/MOSS-TTS-v1.5",
  "voice": "default",
  "input": "${token:150}A sentence with an explicit duration target.",
  "ref_audio": "..."
}

If omitted, the model picks a duration on its own; the SeedTTS benchmark estimates one per sample with --token-count auto.

Text Markup, Style, and Language#

Inline text markup that the model understands (for example [pause Xs], pinyin, and IPA) is passed through unchanged. An optional instructions (alias instruct) field carries a free-text style directive, and an optional language hint biases the target language (omit it to let the model infer from the text):

{
  "model": "OpenMOSS-Team/MOSS-TTS-v1.5",
  "voice": "default",
  "input": "今天天气不错 [pause 0.5s] 就该出去晒晒太阳。",
  "ref_audio": "...", "ref_text": "...",
  "language": "Chinese",
}

Generation Parameters#

Parameter	Default	Notes
`model`	served model	Served model identifier
`input`	(required)	Text to synthesize. It may carry a `${token:N}` duration prefix and inline markup
`voice`	`default`	Voice identifier
`references`	`null`	Reference clip for cloning. Each item has `audio_path` and `text`
`ref_audio` / `ref_text`	`null`	Shorthand for `references[0].audio_path` / `references[0].text`
`stream`	`false`	Stream raw PCM audio chunks
`language`	`null`	Optional target-language hint. Omit to let the model infer
`instructions` / `instruct`	`null`	Optional free-text style directive
`token_count` / `duration_tokens` / `tokens`	`null`	Target duration in codec frames. It must be `> 0`
`max_new_tokens`	`4096`	Maximum generated frames. An explicit value must be `> 0`
`temperature`	`1.5` text / `1.7` audio	Sampling temperature. A single `temperature` overrides both channels
`top_p`	`1.0` text / `0.8` audio	Top-p sampling. A single `top_p` overrides both channels
`top_k`	`50` text / `25` audio	Top-k sampling. A single `top_k` overrides both channels
`repetition_penalty`	`1.0`	Audio repetition penalty
`seed`	`null`	Non-negative integer. See Seed Reproducibility

The per-channel fields (text_temperature, audio_temperature, text_top_p, audio_top_p, text_top_k, audio_top_k, audio_repetition_penalty) are also accepted and take precedence over the single-value aliases.

Seed Reproducibility#

MOSS-TTS samples each row, position, and codebook with multinomial_with_seed, deriving a per-request seed from the public seed and combining it with a per-(step, channel) position. A sampled token therefore depends only on its own seed and position — never on its batch neighbours — so a fixed seed is reproducible at any concurrency, not just batch size 1. Limitations:

Reproducibility holds for a fixed server configuration and hardware. Floating-point non-determinism in the backbone (different batch shapes, GPU models, or kernels) can still change the logits, and thus the sampled tokens, across deployments.
seed must be a non-negative integer; non-integer or negative values are rejected.
Requests without a seed draw a fresh random per-request seed, so they are not reproducible across runs (but are still independent of batch neighbours).

Benchmarking#

MOSS-TTS clones from each prompt (--ref-format references) and estimates a per-sample duration with --token-count auto. Run at --max-concurrency 8; higher concurrency regresses WER.

python -m benchmarks.eval.benchmark_tts_seedtts \
  --meta zhaochenyang20/seed-tts-eval-arrow \
  --model OpenMOSS-Team/MOSS-TTS-v1.5 --port 8000 \
  --ref-format references --token-count auto \
  --output-dir results/moss_tts_en --lang en --max-concurrency 8

Use --lang zh for the Chinese split. See benchmarks/README.md for the full workflow.

Benchmark Results#

Seed-TTS-Eval full set (EN = 1088, ZH = 2020) on 1× H200, concurrency 8, --token-count auto. WER is scored with HF Whisper-large-v3 (EN) / FunASR paraformer-zh (ZH). These are the reference numbers tabulated in benchmarks/eval/benchmark_tts_seedtts.py (source: PR #609) — reproducible references, not CI thresholds.

Lang	WER (corpus)	WER (excl. >50%)	Latency mean / p95 (s)	RTF mean	Throughput (qps)
EN	1.68%	1.32% (4 outliers)	3.449 / 4.141	0.811	2.312
ZH	1.36%	1.27% (2 outliers)	3.608 / 4.153	0.635	2.213

A handful of utterances run away into a repetition loop (> 50% WER) and dominate the raw micro-average; excluding them, corpus WER is ~1.3% in both languages, and per-sample median WER is 0.00%.

Known Limitations#

Voice cloning depends on the reference. Omit the reference for non-cloned speech; provide the transcript (text / ref_text) for the best speaker similarity when cloning.
Concurrency vs. WER. Quality is best around --max-concurrency 8; higher concurrency regresses WER.
Rare runaway generation. A small fraction of utterances can loop and generate up to max_new_tokens; setting a token_count (or lowering max_new_tokens) bounds the output.
Duration is a hint. ${token:N} / token_count steers length but is not an exact clip duration.

MOSS-TTS

Contents