SGLang-Omni

SGLang-Omni#

SGLang-Omni is a high-performance serving framework for omni and multimodal models, built on top of SGLang. It is designed to orchestrate multi-stage pipelines with low latency and OpenAI-compatible APIs.

Modern omni models β€” such as speech-output LLMs and multimodal generation systems β€” decompose into heterogeneous stages with fundamentally different computational profiles: a compute-bound thinker, a memory-bound talker, a latency-sensitive codec. SGLang-Omni is built around a computation-centric design: each stage runs its own independent scheduler tuned to its bottleneck, communicates through a shared inbox/outbox abstraction, and transfers tensors via zero-copy shared memory. This prevents any single stage from degrading the others and allows new models to plug into the framework by declaring a pipeline topology rather than building an inference system from scratch.

About#

Core features:

  • Multi-Stage Pipeline: Flexible framework for orchestrating preprocessing, AR engine, codec, and vocoder stages across processes and GPUs.

  • Native SGLang Integration: Leverages SGLang’s RadixAttention, continuous batching, and CUDA Graph optimizations for the AR backbone.

  • OpenAI-Compatible Server: Drop-in /v1/audio/speech and /v1/chat/completions endpoints with real-time streaming support.

  • Broad Model Support: Supports a growing set of TTS and omni models including Higgs Audio, Fish Audio S2-Pro, Voxtral TTS, Qwen3 TTS, Qwen3-Omni, Ming-Omni, and LLaDA2.0-Uni.

Supported Models#

Model

Type

Notes

boson-sglang/higgs-audio-v3-tts-4b-base

TTS

Voice cloning, streaming, 100+ languages

fishaudio/s2-pro

TTS

Voice cloning, streaming

mistralai/Voxtral-4B-TTS-2603

TTS

Named voices, streaming, 9 languages

Qwen/Qwen3-TTS-12Hz-Base

TTS

Voice cloning, streaming, 10 languages, 0.6B / 1.7B

Qwen/Qwen3-Omni-30B-A3B-Instruct

Omni

Text, image, audio, video β†’ text + audio

inclusionAI/Ming-flash-omni-2.0

Omni

Streaming TTS

inclusionAI/LLaDA2.0-Uni

Multimodal

Text + image understanding and generation

Get Started

Benchmarks