SGLang-Omni#

SGLang-Omni is a high-performance serving framework for omni and multimodal models, built on top of SGLang. It is designed to orchestrate multi-stage pipelines with low latency and OpenAI-compatible APIs.

Modern omni models — such as speech-output LLMs and multimodal generation systems — decompose into heterogeneous stages with fundamentally different computational profiles: a compute-bound thinker, a memory-bound talker, a latency-sensitive codec. SGLang-Omni is built around a computation-centric design: each stage runs its own independent scheduler tuned to its bottleneck, communicates through a shared inbox/outbox abstraction, and transfers tensors via zero-copy shared memory. This prevents any single stage from degrading the others and allows new models to plug into the framework by declaring a pipeline topology rather than building an inference system from scratch.

About#

Core features:

Multi-Stage Pipeline: Flexible framework for orchestrating preprocessing, AR engine, codec, and vocoder stages across processes and GPUs.
Native SGLang Integration: Leverages SGLang’s RadixAttention, continuous batching, and CUDA Graph optimizations for the AR backbone.
OpenAI-Compatible Server: Drop-in /v1/audio/speech, /v1/audio/transcriptions, and /v1/chat/completions endpoints with real-time streaming support.
Broad Model Support: Supports a growing set of TTS, ASR, and omni models including Higgs Audio, Fish Audio S2-Pro, Voxtral TTS, Qwen3 TTS, MOSS-TTS, Ming-Omni-TTS, Qwen3-ASR, Whisper ASR, Qwen3-Omni, Ming-Omni, and LLaDA2.0-Uni.

Supported Models#

Model	Type	Notes
boson-sglang/higgs-audio-v3-tts-4b-base	TTS	Voice cloning, streaming, 100+ languages
fishaudio/s2-pro	TTS	Voice cloning, streaming
mistralai/Voxtral-4B-TTS-2603	TTS	Named voices, streaming, 9 languages
Qwen/Qwen3-TTS-12Hz-Base	TTS	Voice cloning, streaming, 10 languages, 0.6B / 1.7B
OpenMOSS-Team/MOSS-TTS-v1.5	TTS	Voice cloning, streaming, 31 languages
inclusionAI/Ming-omni-tts-16.8B-A3B	TTS	Text-to-speech and zero-shot voice cloning
Qwen/Qwen3-ASR-1.7B	ASR	Audio transcription through `/v1/audio/transcriptions`
openai/whisper-large-v3	ASR	Experimental Whisper transcription route; response schema is served, correctness is not yet validated
Qwen/Qwen3-Omni-30B-A3B-Instruct	Omni	Text, image, audio, video → text + audio
inclusionAI/Ming-flash-omni-2.0	Omni	Streaming TTS
inclusionAI/LLaDA2.0-Uni	Multimodal	Text + image understanding and generation

Get Started

🚀 Installation

Cookbook

General Usage

Benchmarks

Benchmark Relay

Developer Reference

SGLang-Omni

Contents

SGLang-Omni#

About#

Supported Models#