Config#
SGLang-Omni uses declarative config as the contract between model-specific
pipeline definitions and the model-agnostic runtime. PipelineConfig describes
the whole pipeline: model path, stage list, endpoints, relay backend, and global
runtime overrides. StageConfig describes one logical stage: how to construct
it, where it runs, where its normal results go, and whether it participates in
fan-in or streaming edges.
The config layer is intentionally static. It should make topology, placement, and stage construction visible before the runtime starts; request-time behavior belongs in stages, schedulers, model runners, and model-local payload logic.
Declarative Config#
Pipelines are declared with PipelineConfig and StageConfig.
Example:
# Every non-TP stage must declare `process` explicitly — there is no implicit
# default. Each stage below runs in its own OS process; multiple stages can
# share an OS process by giving them the same `process` value (see
# `Qwen3OmniSpeechColocatedPipelineConfig` for that pattern).
stages = [
StageConfig(
name="preprocessing",
process="preprocessing",
factory="...create_preprocessing_executor",
next=["image_encoder", "audio_encoder", "mm_aggregate"],
project_payload={
"image_encoder": "...project_preprocessing_to_image_encoder",
"audio_encoder": "...project_preprocessing_to_audio_encoder",
"mm_aggregate": "...project_preprocessing_to_mm_aggregate",
},
),
StageConfig(
name="mm_aggregate",
process="mm_aggregate",
factory="...create_aggregate_executor",
wait_for=["preprocessing", "image_encoder", "audio_encoder"],
merge_fn="...merge_for_thinker",
next="thinker",
),
StageConfig(
name="thinker",
process="thinker",
factory="...create_sglang_thinker_executor_from_config",
factory_args={"speech_enabled": True},
gpu=0,
next=["decode", "talker_ar"],
stream_to=["talker_ar"],
),
StageConfig(
name="decode",
process="decode",
factory="...create_decode_executor",
terminal=True,
),
StageConfig(
name="code2wav",
process="code2wav",
factory="...create_code2wav_scheduler",
gpu=1,
terminal=True,
),
]
StageConfig Reference#
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Unique stage identifier. |
|
|
required |
Dotted import path to the stage factory. |
|
|
|
Arguments forwarded to the factory. Runtime prep may inject |
|
|
|
Static downstream stage or stages for normal result routing. |
|
|
|
Marks a stage as terminal; terminal results are sent to the coordinator. |
|
|
|
Dotted function path for request-aware result routing. The function receives |
|
|
|
GPU id for the stage. |
|
|
|
Number of tensor-parallel ranks. Must match |
|
|
|
OS process group identifier. Non-TP stages with the same |
|
|
|
Upstream stages required before this stage can execute a request. |
|
|
|
Dotted function path for request-aware fan-in source selection. The function receives |
|
|
|
Dotted import path to the fan-in merge function. Required when |
|
|
|
Static superset of streaming targets for chunks such as hidden states or codec codes. This is parallel to normal result routing. |
|
|
|
Dotted function path for request-aware stream-completion targets. The function receives |
|
|
|
Optional target-stage to dotted projection function mapping used before writing a downstream payload. |
|
|
|
Per-stage relay override. If unset, relay device and defaults are inferred from stage placement and |
Routing rule: set exactly one of next or terminal=True. route_fn is an
optional request-aware override for stages that already declare next; keep
next as the static topology declaration for validation. route_fn is not
valid on terminal stages and must return downstream stage targets, not None.
Fan-in follows the same static-superset pattern: keep wait_for as the full set
of possible upstream stages, and use wait_for_fn only to select the active
per-request subset. The returned subset must be non-empty and contained in
wait_for; returning None keeps the request pending until another upstream
payload can resolve the active set.
When using stream_done_to_fn, keep stream_to as the static superset because
runtime prep derives stream receivers and same-GPU stream paths from it.
Derived from stages:
entry_stage: defaults to the first stage unless explicitly set onPipelineConfigterminal_stages: computed from stages withterminal=Truegpu_placement: computed from stages withgpusetrelay device: explicit
StageConfig.relay.devicewhen present; otherwise inferred by runtime prep fromgpuandrelay_backend
PipelineConfig Reference#
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Hugging Face model id or local checkpoint path. |
|
|
required |
Ordered logical stage definitions. The first stage is the default entry stage. |
|
|
|
Pipeline name. Used for reporting and runtime identification. |
|
|
first stage |
Optional override for the stage that receives new requests. |
|
one of |
|
Global relay backend used when creating per-stage relays. |
|
|
|
Adjacent linear stage groups to colocate in one runtime process, enabling Stage-level local dispatch while preserving normal Stage ownership. |
|
|
|
Per-stage factory argument overrides applied during runtime prep. |
|
|
|
Environment defaults applied before stage factory imports. Existing process values take precedence. |
|
|
IPC defaults |
Endpoint allocation settings. |
|
|
|
Dotted function path for request-aware terminal-stage resolution. The function receives the normalized |
|
|
class name |
Stored automatically and used when loading a saved config file. |
Derived values are computed from stages, not manually maintained:
resolved_entry_stage:entry_stageif set, otherwise the first stage nameterminal_stages: all stages withterminal=Truegpu_placement: stage name to GPU id or TP GPU list for stages withgpu
RelayConfig is the per-stage data-transfer override. It currently contains
slot_size_mb, credits, rank, world_size, and device.
Stage Fusion#
fused_stages is a framework-level colocation hint. It keeps every listed
logical stage as a normal Stage; it does not create a synthetic scheduler or
move routing, relay, fan-in, streaming, abort, or terminal completion into the
scheduler layer.
At runtime prep, each fused group adds a process-colocation constraint. The process topology planner merges the process groups that contain those stages. Once colocated, ordinary Stage routing can use process-local object dispatch for eligible full-payload hops and process-local stream dispatch for same-process stream edges. Cross-process or unsafe fan-out edges still use the relay/control plane path.
The first supported fusion form is conservative: a group must be adjacent,
linear, non-TP, and fit on at most one GPU. Internal stages must route only to
the next stage in the group. Existing explicit process groups are not split;
if fusion connects two process groups, those groups are merged.
Runtime Prep and Runner#
Runtime prep builds the resolved state used by the runner:
validate stage names and static topology
compute the entry stage and terminal stages
allocate ZMQ endpoints
resolve dotted factory, merge, and projection functions
merge
factory_argswithruntime_overridesinject global values such as
model_pathandgpu_idinto factory args when accepted by the factorybuild relay config from stage placement and relay backend
wire stream targets and same-GPU stream fast paths
Serving uses MultiProcessPipelineRunner for both single-process and
multi-process topologies. Runtime prep first resolves GPU placement, then
process topology:
every non-TP stage must declare
processexplicitly — there is no implicit default. Configs saved before this refactor are not auto-migrated: setprocess="pipeline"on every non-TP stage to recover the historical single-process behavior, or use any other shared/distinct process name to opt into the declarative multi-stage-per-process layout;explicit
stage.processgroups non-TP stages declaratively.
A process group may contain CPU stages and stages on at most one GPU. Multiple process groups may share the same GPU only when GPU-stage memory budgets are explicit and fit the configured placement limit.
pipeline/
|-- stage_workers.py # StageLaunchConfig, subprocess entrypoint, StageGroup
|-- runtime_config.py # endpoint/runtime-dir/placement prep
`-- mp_runner.py # Cross-stage orchestration and coordinator ownership
The child process does not recompile the pipeline. The main process builds
fully resolved, picklable stage/process specs; the child imports stage
factories, builds schedulers, constructs Stage objects, signals ready, and
runs one or more non-TP stages in the same event loop.
Tensor Parallelism#
Tensor parallelism inside a stage is orthogonal to pipeline parallelism between stages.
StageConfig(
name="thinker",
factory="...",
gpu=[0, 1, 2, 3],
tp_size=4,
)
For tp_size > 1, the runner derives one process per TP rank. Each process runs
the stage scheduler and model worker with a different tp_rank and GPU. NCCL
collectives inside model forward keep TP ranks in lockstep. StageConfig.process
is optional for TP stages; if set, it acts as the prefix for the derived
per-rank process names ({process}_tp{rank}); if unset, the stage name is used
as the prefix. TP ranks always own their OS process exclusively — a TP stage’s
process group cannot host any other stage, regardless of whether process is
set or unset.
Only rank 0 owns external stage IO:
rank 0 receives ZMQ messages from the coordinator or previous stage
rank 0 fans work and aborts out to follower ranks
all ranks make the same scheduling decisions
only rank 0 sends downstream results or terminal completions
Each TP stage gets its own NCCL port allocation so multiple TP groups can exist inside one pipeline.