API Server Design#
This page explains the API server at the level that is most useful for maintenance: where it sits in the system, which files matter, and how requests are mapped into the runtime.
If you only want to launch the server and call it, start with API Server Quickstart.
Role in the System#
The API server is the outer protocol layer on top of the sglang-omni pipeline runtime.
At a high level, the built-in server startup path is:
CLI / Python entrypoint → PipelineConfig → Pipeline Startup → Coordinator → Client → FastAPI
After startup, the request path is:
HTTP request → FastAPI route → Client → Coordinator → Stage pipeline → Client aggregation → HTTP/SSE response
That split keeps responsibilities clean:
the pipeline runtime handles orchestration and execution
the
Clientlayer submits requests and assembles resultsthe API server translates between HTTP/OpenAI-style payloads and those internal abstractions
Key Files#
For the current server implementation, these are the files that matter most.
File |
Role |
|---|---|
|
Defines the FastAPI app, routes, request conversion, and response formatting |
|
Defines request and response schemas |
|
Compiles the pipeline, starts the runtime, mounts the app, and runs Uvicorn |
|
Submits requests to the coordinator and aggregates text, audio, and stream results |
|
Defines the current CLI surface for |
If you are tracing endpoint behavior, openai_api.py and client.py are usually the best places to start.
create_app() vs launch_server()#
This is the most important distinction in the serving code.
create_app(client, model_name=...)#
create_app() only builds the FastAPI app and registers the core routes.
It does not:
compile the pipeline
start the runtime
create the coordinator
mount profiling routes
run Uvicorn
Use it when you already have a live Client and want to embed the HTTP layer yourself.
launch_server(pipeline_config, ...)#
launch_server() is the full built-in server lifecycle.
It:
compiles the pipeline config
starts the pipeline runtime
creates the
Clientcreates the FastAPI app
mounts profiling routes on the single-process path
runs Uvicorn
stops the runtime on shutdown
Use it when you want the standard out-of-the-box server path.
Route Surface#
The current server exposes these main routes:
Method |
Path |
Notes |
|---|---|---|
|
|
Health status from |
|
|
Single-model listing for the active pipeline |
|
|
Chat completions, including streaming and optional audio |
|
|
Text-to-speech, raw audio response or SSE chunks when |
|
|
Torch trace + (optional) request-level events. Added by the built-in launcher |
|
|
Stops both torch trace and request-level events |
|
|
Request-level event recorder only (no torch trace) |
|
|
Stops the request-level event recorder |
The profiling routes are mounted by the single-process launch_server() path. The
current multi-process launcher path does not mount them.
/start_profile accepts:
{
"run_id": "demo-run",
"trace_path_template": "/tmp/profiles/demo-run/trace", // torch trace template
"event_dir": "/tmp/profiles/demo-run/events", // request-event JSONL dir (optional)
"enable_torch": true // set false to skip torch trace
}
/stop_profile and /stop_request_profile both accept an optional
run_id. Omitting it is a wildcard: every stage stops whatever profiler
session is currently active.
Request-level events are emitted as JSON lines under
<event_dir>/events_<stage>_<pid>.jsonl. Use python -m sglang_omni.profiler <event_dir> to derive the timeline / stage / hop reports described in
docs/developer_reference/profiler.md.
Request Mapping#
The server does not pass OpenAI-style request bodies straight into the runtime. It first converts them into internal request objects.
Chat requests#
ChatCompletionRequest includes standard OpenAI-style fields such as:
modelmessagestemperaturetop_pmax_tokensstopseedstream
It also includes sglang-omni extensions such as:
imagesaudiosvideosvideo processing overrides such as
video_fpsand frame/pixel limitsmodalitiesaudiostage_samplingstage_paramstalker-specific generation overrides
request_id
Conversion into GenerateRequest#
_build_chat_generate_request() in openai_api.py is the key translation point. It:
normalizes stop sequences
builds
SamplingParamsconverts chat messages into internal
Messageobjectsmaps per-stage sampling overrides
passes per-stage runtime params through
stage_paramsstores media input, audio config, and video processing overrides in request metadata
stores talker-specific generation overrides in
extra_paramscopies
modalitiesintooutput_modalities
The route hands that GenerateRequest to Client. The client then converts it
to an OmniRequest before submitting it to the coordinator.
Response Paths#
Non-streaming chat#
For non-streaming chat, the path is roughly:
chat request → Client.completion() → OpenAI-style JSON response
Client.completion() aggregates:
text fragments
audio chunks
final usage
final finish reason
If audio is present, it is base64-encoded before being returned to the API layer.
Streaming chat#
For streaming chat, the server emits SSE events.
The important current semantics are:
the first chunk may contain only
role="assistant"text and audio are emitted as separate deltas
the final completion chunk includes the
finish_reasonthe stream ends with
data: [DONE]usageis attached to the final completion chunk
Speech / TTS#
The speech route reuses the same internal request path rather than introducing a separate serving stack.
CreateSpeechRequest is converted into a GenerateRequest with:
output_modalities=["audio"]task="tts"in metadataTTS-specific parameters stored under
tts_params
For non-streaming requests, Client.speech() collects audio chunks, encodes
them, and returns raw audio bytes to the HTTP layer.
For stream=true, the route returns SSE events. Each event carries a
base64-encoded audio chunk and format metadata; the stream ends with a final
chunk carrying finish_reason followed by data: [DONE].