SGLang-Omni#
SGLang-Omni is a high-performance serving framework for omni and multimodal models, built on top of SGLang. It is designed to orchestrate multi-stage pipelines with low latency and OpenAI-compatible APIs.
Modern omni models β such as speech-output LLMs and multimodal generation systems β decompose into heterogeneous stages with fundamentally different computational profiles: a compute-bound thinker, a memory-bound talker, a latency-sensitive codec. SGLang-Omni is built around a computation-centric design: each stage runs its own independent scheduler tuned to its bottleneck, communicates through a shared inbox/outbox abstraction, and transfers tensors via zero-copy shared memory. This prevents any single stage from degrading the others and allows new models to plug into the framework by declaring a pipeline topology rather than building an inference system from scratch.
About#
Core features:
Multi-Stage Pipeline: Flexible framework for orchestrating preprocessing, AR engine, codec, and vocoder stages across processes and GPUs.
Native SGLang Integration: Leverages SGLangβs RadixAttention, continuous batching, and CUDA Graph optimizations for the AR backbone.
OpenAI-Compatible Server: Drop-in
/v1/audio/speechand/v1/chat/completionsendpoints with real-time streaming support.Broad Model Support: Supports a growing set of TTS and omni models including Higgs Audio, Fish Audio S2-Pro, Voxtral TTS, Qwen3 TTS, Qwen3-Omni, Ming-Omni, and LLaDA2.0-Uni.
Supported Models#
Model |
Type |
Notes |
|---|---|---|
TTS |
Voice cloning, streaming, 100+ languages |
|
TTS |
Voice cloning, streaming |
|
TTS |
Named voices, streaming, 9 languages |
|
TTS |
Voice cloning, streaming, 10 languages, 0.6B / 1.7B |
|
Omni |
Text, image, audio, video β text + audio |
|
Omni |
Streaming TTS |
|
Multimodal |
Text + image understanding and generation |
Get Started
Basic Usage
Benchmarks
Developer Reference