API Server Quickstart#

This page is the shortest path from “I have the repo” to “the API server responds to a request”.

sglang-omni exposes an OpenAI-compatible API server on top of its multi-stage pipeline runtime. That server is the main HTTP entry point for:

chat completions
streaming responses
model listing
health checks
text-to-speech

If you want the internal design rather than the usage flow, see API Server Design.

What This Server Is#

The API server is an adapter between HTTP clients and the internal pipeline runtime:

HTTP request → FastAPI app → Client → Coordinator → Pipeline stages

In other words, it does not run model logic directly. Its job is to translate OpenAI-style requests into internal requests and format the results back into HTTP responses.

Start the Server#

The installed CLI entrypoint is sgl-omni.

The simplest way to start the server is to provide a model path and let sglang-omni build the pipeline config for you:

sgl-omni serve \
  --model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --host 0.0.0.0 \
  --port 8000

The most useful flags are:

--model-path: Hugging Face model ID or local model directory
--host: bind address, default 0.0.0.0
--port: bind port, default 8000
--model-name: override the model name returned by /v1/models
--log-level: logging level for the server process

If you already have a pipeline config file, you can also pass --config path/to/config.yaml. When the config file contains model_path, --model-path is optional and can be used as an override.

Check That It Works#

Health check#

curl -s http://localhost:8000/health

Example response:

{
  "status": "healthy",
  "running": true
}

The server returns:

200 when the runtime is healthy
503 when the HTTP server is up but the underlying runtime reports unhealthy status

List the served model#

curl -s http://localhost:8000/v1/models

This endpoint returns a single-model list. The model ID comes from --model-name if you set it, otherwise from the pipeline name.

Send a Minimal Chat Request#

The core endpoint is POST /v1/chat/completions.

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-omni",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 128,
    "stream": false
  }'

The response follows the OpenAI chat completion shape. In the common case, the text is in choices[0].message.content.

Besides model and messages, the most useful request fields are:

temperature
top_p
max_tokens
stop
seed
stream

Text-to-Speech#

The server also exposes POST /v1/audio/speech.

curl -s http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-omni",
    "input": "Hello from SGLang-Omni.",
    "voice": "default",
    "response_format": "wav"
  }' \
  -o speech.wav

Two things to remember:

the response body is audio bytes, not JSON
the actual output format may differ from the requested one if the encoder falls back to another supported codec

Common Errors#

When requests fail, the server returns standard HTTP error codes:

400 Bad Request: malformed request body or invalid parameters
500 Internal Server Error: runtime error during generation (check server logs for details)
503 Service Unavailable: the runtime is not healthy (verify with /health)

If you see a 500 error, check the server logs for the full traceback. Common issues include:

unsupported media formats
out-of-memory errors
missing model files

API Server Quickstart

Contents

API Server Quickstart#

What This Server Is#

Start the Server#

Check That It Works#

Health check#

List the served model#

Send a Minimal Chat Request#

Text-to-Speech#

Common Errors#

Next Reading#

API Server Quickstart

Contents

API Server Quickstart#

What This Server Is#

Start the Server#

Check That It Works#

Health check#

List the served model#

Send a Minimal Chat Request#

Streaming and Multi-modal Requests#

Streaming#

Multi-modal input and output#

Text-to-Speech#

Common Errors#

Next Reading#