API Server Quickstart#
This page is the shortest path from “I have the repo” to “the API server responds to a request”.
sglang-omni exposes an OpenAI-compatible API server on top of its multi-stage pipeline runtime. That server is the main HTTP entry point for:
chat completions
streaming responses
model listing
health checks
text-to-speech
If you want the internal design rather than the usage flow, see API Server Design.
What This Server Is#
The API server is an adapter between HTTP clients and the internal pipeline runtime:
HTTP request → FastAPI app → Client → Coordinator → Pipeline stages
In other words, it does not run model logic directly. Its job is to translate OpenAI-style requests into internal requests and format the results back into HTTP responses.
Start the Server#
The installed CLI entrypoint is sgl-omni.
The simplest way to start the server is to provide a model path and let sglang-omni build the pipeline config for you:
sgl-omni serve \
--model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
--host 0.0.0.0 \
--port 8000
The most useful flags are:
--model-path: Hugging Face model ID or local model directory--host: bind address, default0.0.0.0--port: bind port, default8000--model-name: override the model name returned by/v1/models--log-level: logging level for the server process
If you already have a pipeline config file, you can also pass --config path/to/config.yaml. When the config file contains model_path, --model-path is optional and can be used as an override.
Check That It Works#
Health check#
curl -s http://localhost:8000/health
Example response:
{
"status": "healthy",
"running": true
}
The server returns:
200when the runtime is healthy503when the HTTP server is up but the underlying runtime reports unhealthy status
List the served model#
curl -s http://localhost:8000/v1/models
This endpoint returns a single-model list. The model ID comes from --model-name if you set it, otherwise from the pipeline name.
Send a Minimal Chat Request#
The core endpoint is POST /v1/chat/completions.
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-omni",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 128,
"stream": false
}'
The response follows the OpenAI chat completion shape. In the common case, the text is in choices[0].message.content.
Besides model and messages, the most useful request fields are:
temperaturetop_pmax_tokensstopseedstream
Streaming and Multi-modal Requests#
Streaming#
Set stream to true to receive Server-Sent Events (SSE):
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-omni",
"messages": [
{"role": "user", "content": "Write a short greeting."}
],
"stream": true
}'
A few details matter here:
the response type is
text/event-streamthe first chunk may contain only
role="assistant"the stream ends with
data: [DONE]usageis attached to the final completion chunk
Multi-modal input and output#
sglang-omni extends the standard OpenAI chat schema with a few extra fields:
imagesaudiosvideosmodalitiesaudiostage_samplingstage_params
For example, a video request with text output looks like this:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-omni",
"messages": [
{"role": "user", "content": "What is happening in this video?"}
],
"videos": ["/absolute/path/to/demo.mp4"],
"modalities": ["text"],
"max_tokens": 128,
"stream": false
}'
The videos, images, and audios fields accept either local file paths or HTTP(S) URLs.
Text-to-Speech#
The server also exposes POST /v1/audio/speech.
curl -s http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-omni",
"input": "Hello from SGLang-Omni.",
"voice": "default",
"response_format": "wav"
}' \
-o speech.wav
Two things to remember:
the response body is audio bytes, not JSON
the actual output format may differ from the requested one if the encoder falls back to another supported codec
Common Errors#
When requests fail, the server returns standard HTTP error codes:
400 Bad Request: malformed request body or invalid parameters500 Internal Server Error: runtime error during generation (check server logs for details)503 Service Unavailable: the runtime is not healthy (verify with/health)
If you see a 500 error, check the server logs for the full traceback. Common issues include:
unsupported media formats
out-of-memory errors
missing model files