# API Server Design

This page explains the API server at the level that is most useful for maintenance: where it sits in the system, which files matter, and how requests are mapped into the runtime.

If you only want to launch the server and call it, start with [API Server Quickstart](../get_started/apiserver_quickstart.md).

## Role in the System

The API server is the outer protocol layer on top of the `sglang-omni` pipeline runtime.

At a high level, the built-in server startup path is:

`CLI / Python entrypoint` → `PipelineConfig` → `Pipeline Startup` → `Coordinator` → `Client` → `FastAPI`

After startup, the request path is:

`HTTP request` → `FastAPI route` → `Client` → `Coordinator` → `Stage pipeline` → `Client aggregation` → `HTTP/SSE response`

That split keeps responsibilities clean:

- the pipeline runtime handles orchestration and execution
- the `Client` layer submits requests and assembles results
- the API server translates between HTTP/OpenAI-style payloads and those internal abstractions

## Key Files

For the current server implementation, these are the files that matter most.

| File | Role |
| --- | --- |
| `sglang_omni/serve/openai_api.py` | Defines the FastAPI app, routes, request conversion, and response formatting |
| `sglang_omni/serve/protocol.py` | Defines request and response schemas |
| `sglang_omni/serve/launcher.py` | Compiles the pipeline, starts the runtime, mounts the app, and runs Uvicorn |
| `sglang_omni/client/client.py` | Submits requests to the coordinator and aggregates text, audio, and stream results |
| `sglang_omni/cli/serve.py` | Defines the current CLI surface for `sgl-omni serve` |

If you are tracing endpoint behavior, `openai_api.py` and `client.py` are usually the best places to start.

## `create_app()` vs `launch_server()`

This is the most important distinction in the serving code.

### `create_app(client, model_name=...)`

`create_app()` only builds the FastAPI app and registers the core routes.

It does **not**:

- compile the pipeline
- start the runtime
- create the coordinator
- mount profiling routes
- run Uvicorn

Use it when you already have a live `Client` and want to embed the HTTP layer yourself.

### `launch_server(pipeline_config, ...)`

`launch_server()` is the full built-in server lifecycle.

It:

- compiles the pipeline config
- starts the pipeline runtime
- creates the `Client`
- creates the FastAPI app
- mounts profiling routes on the single-process path
- runs Uvicorn
- stops the runtime on shutdown

Use it when you want the standard out-of-the-box server path.

## Route Surface

The current server exposes these main routes:

| Method | Path | Notes |
| --- | --- | --- |
| `GET` | `/health` | Health status from `client.health()` |
| `GET` | `/v1/models` | Single-model listing for the active pipeline |
| `POST` | `/v1/chat/completions` | Chat completions, including streaming and optional audio |
| `POST` | `/v1/audio/speech` | Text-to-speech, raw audio response or SSE chunks when `stream=true` |
| `POST` | `/start_profile` | Torch trace + (optional) request-level events. Added by the built-in launcher |
| `POST` | `/stop_profile` | Stops both torch trace and request-level events |
| `POST` | `/start_request_profile` | Request-level event recorder only (no torch trace) |
| `POST` | `/stop_request_profile` | Stops the request-level event recorder |

The profiling routes are mounted by the single-process `launch_server()` path. The
current multi-process launcher path does not mount them.

`/start_profile` accepts:

```jsonc
{
  "run_id": "demo-run",
  "trace_path_template": "/tmp/profiles/demo-run/trace",  // torch trace template
  "event_dir": "/tmp/profiles/demo-run/events",            // request-event JSONL dir (optional)
  "enable_torch": true                                     // set false to skip torch trace
}
```

`/stop_profile` and `/stop_request_profile` both accept an optional
`run_id`. Omitting it is a wildcard: every stage stops whatever profiler
session is currently active.

Request-level events are emitted as JSON lines under
`<event_dir>/events_<stage>_<pid>.jsonl`. Use `python -m sglang_omni.profiler
<event_dir>` to derive the timeline / stage / hop reports described in
`docs/developer_reference/profiler.md`.

## Request Mapping

The server does not pass OpenAI-style request bodies straight into the runtime. It first converts them into internal request objects.

### Chat requests

`ChatCompletionRequest` includes standard OpenAI-style fields such as:

- `model`
- `messages`
- `temperature`
- `top_p`
- `max_tokens`
- `stop`
- `seed`
- `stream`

It also includes `sglang-omni` extensions such as:

- `images`
- `audios`
- `videos`
- video processing overrides such as `video_fps` and frame/pixel limits
- `modalities`
- `audio`
- `stage_sampling`
- `stage_params`
- talker-specific generation overrides
- `request_id`

### Conversion into `GenerateRequest`

`_build_chat_generate_request()` in `openai_api.py` is the key translation point. It:

- normalizes stop sequences
- builds `SamplingParams`
- converts chat messages into internal `Message` objects
- maps per-stage sampling overrides
- passes per-stage runtime params through `stage_params`
- stores media input, audio config, and video processing overrides in request metadata
- stores talker-specific generation overrides in `extra_params`
- copies `modalities` into `output_modalities`

The route hands that `GenerateRequest` to `Client`. The client then converts it
to an `OmniRequest` before submitting it to the coordinator.

## Response Paths

### Non-streaming chat

For non-streaming chat, the path is roughly:

`chat request` → `Client.completion()` → OpenAI-style JSON response

`Client.completion()` aggregates:

- text fragments
- audio chunks
- final usage
- final finish reason

If audio is present, it is base64-encoded before being returned to the API layer.

### Streaming chat

For streaming chat, the server emits SSE events.

The important current semantics are:

- the first chunk may contain only `role="assistant"`
- text and audio are emitted as separate deltas
- the final completion chunk includes the `finish_reason`
- the stream ends with `data: [DONE]`
- `usage` is attached to the final completion chunk

### Speech / TTS

The speech route reuses the same internal request path rather than introducing a separate serving stack.

`CreateSpeechRequest` is converted into a `GenerateRequest` with:

- `output_modalities=["audio"]`
- `task="tts"` in metadata
- TTS-specific parameters stored under `tts_params`

For non-streaming requests, `Client.speech()` collects audio chunks, encodes
them, and returns raw audio bytes to the HTTP layer.

For `stream=true`, the route returns SSE events. Each event carries a
base64-encoded audio chunk and format metadata; the stream ends with a final
chunk carrying `finish_reason` followed by `data: [DONE]`.
