Request-level profiler#

sglang-omni ships two complementary profilers that share the same run_id and are controlled by the same HTTP surface:

a request-level event recorder that writes a JSONL stream of per-request milestones (admission, preprocess, encoder, prefill, first token / first code chunk, hops, terminal response) — used to reconstruct a single request’s end-to-end timeline and to aggregate stage / hop costs across a batch;
a torch profiler that produces a Chrome trace of kernel-level CPU / CUDA activity — used to drill into a specific window once the event recorder has identified where the time is going.

Most diagnostics use the event recorder. The torch profiler is opt-in for deeper kernel investigation.

Event model#

Every instrumentation point appends a single line of JSON to a per-process JSONL file. The shape:

{
  "request_id": "req-123",
  "stage": "thinker",
  "event_name": "scheduler_first_emit",
  "timestamp_ns": 1717000000123456789,
  "run_id": "demo-run",
  "pid": 42,
  "metadata": {"chunk_id": 0}
}

Files are written under <event_dir>/events_<stage>_<pid>.jsonl. Multiple co-located stages in the same OS process share one JSONL file — the filename uses the first stage to start, and the per-event stage field identifies the owner. The views layer merges files from every process by request_id.

Standard event names#

The recorder always attaches the active stage name to every event, so the same scheduler_prefill_start becomes “thinker prefill start” when emitted from the thinker process and “talker prefill start” when emitted from the talker process. scheduler_queue_enter marks a built request entering the scheduler queue; scheduler_prefill_start is emitted later, when the request’s first executable prefill / extend batch is selected.

Pipeline milestone	Concrete event	Source
Request admission	`request_admission`	`Coordinator._submit_request`
Preprocessing start / end	`preprocess_start` / `preprocess_end`	model preprocessor `__call__`
Encoder start / end	`encoder_start` / `encoder_end` (metadata `modality`, `batch_size`)	image / audio encoder executors
Aggregate ready	`stage_aggregate_ready`	`Stage._on_data_ready` after `InputHandler.receive` returns a merged payload
Thinker prefill start	`scheduler_prefill_start` (stage = thinker)	`OmniScheduler.run_batch`
Thinker first token	`stage_first_stream_chunk_sent` (stage = thinker)	`Stage._send_stream_to_target` / `_send_stream_to_coordinator`
First stream chunk to client	`stage_first_stream_chunk_sent` (terminal stage → coordinator)	same
Talker request build execution start / end	`scheduler_request_build_start` / `_end` (stage = talker)	`OmniScheduler._run_request_builder`
Talker prefill start	`scheduler_prefill_start` (stage = talker)	same
First code chunk	`stage_first_stream_chunk_sent` (stage = talker)	`Stage._send_stream_to_target`
Code2Wav first audio	`code2wav_first_audio`	`Code2WavScheduler.decode_delta` / `run_step`
Terminal response	`terminal_response`	`Coordinator._handle_completion`

Supporting events used for finer-grained breakdown:

Layer	Event	Notes
Coordinator	`coordinator_stream_received`	Each `StreamMessage` received on the coordinator
Stage	`stage_input_received`	Submit or relay payload accepted (metadata `from_stage`)
Stage	`stage_dispatch`	Scheduler inbox put
Stage	`stage_complete`	Scheduler result routed onward (metadata `terminal`, `next`)
Stage	`stage_hop_sent`	Payload `DataReadyMessage` sent to next stage
Stage	`stage_stream_chunk_sent`	Each stream chunk (metadata `to_stage`, `chunk_id`, `modality`)
Stage	`stage_stream_chunk_received`	Each stream chunk materialized and ready for the receiver scheduler, including coordinator terminal chunks
AR scheduler	`scheduler_queue_enter`	Built request entered the scheduler queue
AR scheduler	`scheduler_first_emit`	First `stream_output_builder` emission per request
Code2Wav	`code2wav_decode_start`	Serial decode start: trigger, start/end/new/context/window frames, active and threshold-ready requests, inbox depth
Code2Wav	`code2wav_decode_end`	Repeats the start metadata and adds audio samples plus execution metadata
Code2Wav	`code2wav_batch_start`	Coalesced step start: batch and bucket shape, new/window frames, active requests, inbox depth, oldest wait, fire reason, due-bucket count, and sub-batch decomposition
Code2Wav	`code2wav_batch_end`	Repeats the start metadata and adds audio samples, execution mode, graph key, and fallback reason

Custom callsites can call sglang_omni.profiler.event_recorder.emit(...) to add domain-specific events. Events from inactive recorders are no-ops, so instrumentation sites do not need to guard against the disabled case.

Active-stage attribution#

emit(...) accepts an explicit stage=... parameter; when the caller can’t plumb the stage name down (preprocessor __call__, encoder callables, OmniScheduler / Code2WavScheduler internals), it can pass stage=None and the recorder fills it in from the per-thread / per-task active stage.

Stage._run_scheduler binds set_active_stage(self.name) on the scheduler thread before invoking the scheduler. The binding uses both a threading.local slot (for plain threading.Thread workers) and a contextvars.ContextVar (so it propagates through asyncio.to_thread / loop.run_in_executor, which copy contextvars but not thread-local). Explicit stage=... on emit always wins; the active-stage binding is only consulted when the caller passes stage=None.

To bind / unbind manually from your own thread:

from sglang_omni.profiler.event_recorder import set_active_stage, reset_active_stage

token = set_active_stage("my_stage")
try:
    ...
finally:
    reset_active_stage(token)

reset_active_stage(None) is the “scrub” form (used by test fixtures) and clears both the thread-local slot and the contextvar.

Lifecycle#

The recorder is process-local. It is started on every stage and on the coordinator when POST /start_profile (or POST /start_request_profile) is hit:

Launcher receives the HTTP request.
Coordinator starts its local recorder pointed at <event_dir>.
Launcher broadcasts ProfilerStartMessage over ZMQ to every stage, carrying both the torch trace template and the event_dir.
Each stage joins the per-process recorder. In a shared-process topology the first stage to call start() wins the filename; every subsequent stage in the same process writes to the same file and the per-event stage field disambiguates.
On POST /stop_profile, the recorder is closed everywhere; files remain on disk under <event_dir>.

POST /stop_profile and POST /stop_request_profile accept an optional run_id field. When omitted, the request is a wildcard: every stage stops whatever profiler session is currently active. When set, only stages whose active run matches stop. This makes the common case (caller didn’t specify a run_id on either start or stop) work without ceremony.

The torch profiler and the event recorder share a run_id. Setting enable_torch=false on the start request records JSONL events without paying for a kernel trace.

Generating reports#

Use the views module directly:

from sglang_omni.profiler.views import build_report
report = build_report("/tmp/profiles/demo-run/events")
print(report["request_count"], len(report["stage_breakdown"]))

…or via the CLI:

python -m sglang_omni.profiler /tmp/profiles/demo-run/events --format table
python -m sglang_omni.profiler /tmp/profiles/demo-run/events --format json --out report.json

The CLI / build_report returns three views derived from the same event stream:

Timeline — per-request event list with t_rel_ms anchored at admission.
Stage breakdown — (open_event, close_event) interval durations aggregated per stage (count, total, avg, p50, p95, max). The same opener can participate in multiple pairs (e.g. scheduler_prefill_start closes against both scheduler_first_emit AND stage_first_stream_chunk_sent); every pair gets its own pending stack so a close event for pair A does not consume the opener of pair B.
Hop breakdown — stage_hop_sent / stage_input_received and stage_stream_chunk_sent / stage_stream_chunk_received durations per (source, destination, kind). Terminal stage stream chunks are paired the same way with destination coordinator.

Hop pairs match across processes by (request_id, source_stage, dest_stage, chunk_id?), so a single request’s path through subprocesses can be reconstructed even when each stage runs in its own process.

Torch profiler#

The torch profiler runs alongside the event recorder when enable_torch=true (the default for /start_profile). It records continuously between start() and stop() — no schedule(...), no step() requirement — and exports a Chrome trace *.trace.json.gz on stop.

The expensive introspection flags are opt-in via env vars so the default trace stays small enough to load in chrome://tracing or ui.perfetto.dev:

Env var	Effect
`SGLANG_TORCH_PROFILER_RECORD_SHAPES=1`	Record input tensor shapes per op
`SGLANG_TORCH_PROFILER_PROFILE_MEMORY=1`	Track every CUDA caching-allocator alloc / free
`SGLANG_TORCH_PROFILER_WITH_STACK=1`	Record the Python (and C++) call stack per op
`SGLANG_TORCH_PROFILER_WITH_FLOPS=1`	Estimate FLOPs per op

With all four off (the default), a typical 10-sample MMMU run produces a trace in the tens of MB. With all four on, the same workload can produce a multi-GB trace — only opt in when you need that specific information.

HTTP surface#

Method	Path	Body	Notes
POST	`/start_profile`	`{"run_id": ?, "trace_path_template": ?, "event_dir": ?, "enable_torch": true \| false, "config": ?}`	Starts torch trace + event recorder. `run_id` auto-generated if omitted.
POST	`/stop_profile`	`{"run_id": ?}`	Stops torch trace + event recorder. Omitting `run_id` is a wildcard (“stop whatever’s active”).
POST	`/start_request_profile`	`{"run_id": ?, "event_dir": ?}`	Event recorder only — no torch trace. Lower overhead; safer to leave on.
POST	`/stop_request_profile`	`{"run_id": ?}`	Same wildcard semantics as `/stop_profile`.

Example: record cheap events on every request without a kernel trace:

curl -X POST http://localhost:8000/start_request_profile \
     -d '{"run_id":"demo","event_dir":"/tmp/profiles/demo/events"}'
# … run traffic …
curl -X POST http://localhost:8000/stop_request_profile -d '{}'
python -m sglang_omni.profiler /tmp/profiles/demo/events --format table

Discipline#

Profiling must never break serving. The emitter swallows write errors and counts drops; the first failure is logged once.
Tensors and large blobs stay out of event metadata. Keep metadata to small scalars (ids, counts, durations, modality, error strings). The recorder enforces this defensively: if a tensor / numpy array ends up in metadata, _json_default serializes a summary ({"__tensor_summary__": true, "type": ..., "shape": [...], "dtype": "...", "device": "..."}) instead of materializing the contents. 0-D tensors / numpy scalars still serialize as plain scalars.
Event naming. Lowercase snake_case, prefix with the layer that owns the event (stage_*, scheduler_*, encoder_*, etc.). Use the stage name (not the event name) to distinguish “thinker prefill start” from “talker prefill start”.

Request-level profiler

Contents