Request-level profiler#
sglang-omni ships two complementary profilers that share the same run_id
and are controlled by the same HTTP surface:
a request-level event recorder that writes a JSONL stream of per-request milestones (admission, preprocess, encoder, prefill, first token / first code chunk, hops, terminal response) — used to reconstruct a single request’s end-to-end timeline and to aggregate stage / hop costs across a batch;
a torch profiler that produces a Chrome trace of kernel-level CPU / CUDA activity — used to drill into a specific window once the event recorder has identified where the time is going.
Most diagnostics use the event recorder. The torch profiler is opt-in for deeper kernel investigation.
Event model#
Every instrumentation point appends a single line of JSON to a per-process JSONL file. The shape:
{
"request_id": "req-123",
"stage": "thinker",
"event_name": "scheduler_first_emit",
"timestamp_ns": 1717000000123456789,
"run_id": "demo-run",
"pid": 42,
"metadata": {"chunk_id": 0}
}
Files are written under <event_dir>/events_<stage>_<pid>.jsonl. Multiple
co-located stages in the same OS process share one JSONL file — the
filename uses the first stage to start, and the per-event stage field
identifies the owner. The views layer merges files from every process by
request_id.
Standard event names#
The recorder always attaches the active stage name to every event, so the
same scheduler_prefill_start becomes “thinker prefill start” when emitted
from the thinker process and “talker prefill start” when emitted from the
talker process. scheduler_queue_enter marks a built request entering the
scheduler queue; scheduler_prefill_start is emitted later, when the request’s
first executable prefill / extend batch is selected.
Pipeline milestone |
Concrete event |
Source |
|---|---|---|
Request admission |
|
|
Preprocessing start / end |
|
model preprocessor |
Encoder start / end |
|
image / audio encoder executors |
Aggregate ready |
|
|
Thinker prefill start |
|
|
Thinker first token |
|
|
First stream chunk to client |
|
same |
Talker request build start / end |
|
|
Talker prefill start |
|
same |
First code chunk |
|
|
Code2Wav first audio |
|
|
Terminal response |
|
|
Supporting events used for finer-grained breakdown:
Layer |
Event |
Notes |
|---|---|---|
Coordinator |
|
Each |
Stage |
|
Submit or relay payload accepted (metadata |
Stage |
|
Scheduler inbox put |
Stage |
|
Scheduler result routed onward (metadata |
Stage |
|
Payload |
Stage |
|
Each stream chunk (metadata |
Stage |
|
Each stream chunk materialized and ready for the receiver scheduler, including coordinator terminal chunks |
AR scheduler |
|
Built request entered the scheduler queue |
AR scheduler |
|
First |
Custom callsites can call sglang_omni.profiler.event_recorder.emit(...) to
add domain-specific events. Events from inactive recorders are no-ops, so
instrumentation sites do not need to guard against the disabled case.
Active-stage attribution#
emit(...) accepts an explicit stage=... parameter; when the caller can’t
plumb the stage name down (preprocessor __call__, encoder callables,
OmniScheduler / Code2WavScheduler internals), it can pass stage=None
and the recorder fills it in from the per-thread / per-task active
stage.
Stage._run_scheduler binds set_active_stage(self.name) on the scheduler
thread before invoking the scheduler. The binding uses both a
threading.local slot (for plain threading.Thread workers) and a
contextvars.ContextVar (so it propagates through asyncio.to_thread /
loop.run_in_executor, which copy contextvars but not thread-local).
Explicit stage=... on emit always wins; the active-stage binding is only
consulted when the caller passes stage=None.
To bind / unbind manually from your own thread:
from sglang_omni.profiler.event_recorder import set_active_stage, reset_active_stage
token = set_active_stage("my_stage")
try:
...
finally:
reset_active_stage(token)
reset_active_stage(None) is the “scrub” form (used by test fixtures) and
clears both the thread-local slot and the contextvar.
Lifecycle#
The recorder is process-local. It is started on every stage and on the
coordinator when POST /start_profile (or POST /start_request_profile)
is hit:
Launcher receives the HTTP request.
Coordinator starts its local recorder pointed at
<event_dir>.Launcher broadcasts
ProfilerStartMessageover ZMQ to every stage, carrying both the torch trace template and theevent_dir.Each stage joins the per-process recorder. In a shared-process topology the first stage to call
start()wins the filename; every subsequent stage in the same process writes to the same file and the per-eventstagefield disambiguates.On
POST /stop_profile, the recorder is closed everywhere; files remain on disk under<event_dir>.
POST /stop_profile and POST /stop_request_profile accept an optional
run_id field. When omitted, the request is a wildcard: every stage
stops whatever profiler session is currently active. When set, only
stages whose active run matches stop. This makes the common case (caller
didn’t specify a run_id on either start or stop) work without ceremony.
The torch profiler and the event recorder share a run_id. Setting
enable_torch=false on the start request records JSONL events without
paying for a kernel trace.
Generating reports#
Use the views module directly:
from sglang_omni.profiler.views import build_report
report = build_report("/tmp/profiles/demo-run/events")
print(report["request_count"], len(report["stage_breakdown"]))
…or via the CLI:
python -m sglang_omni.profiler /tmp/profiles/demo-run/events --format table
python -m sglang_omni.profiler /tmp/profiles/demo-run/events --format json --out report.json
The CLI / build_report returns three views derived from the same event
stream:
Timeline — per-request event list with
t_rel_msanchored at admission.Stage breakdown —
(open_event, close_event)interval durations aggregated per stage (count, total, avg, p50, p95, max). The same opener can participate in multiple pairs (e.g.scheduler_prefill_startcloses against bothscheduler_first_emitANDstage_first_stream_chunk_sent); every pair gets its own pending stack so a close event for pair A does not consume the opener of pair B.Hop breakdown —
stage_hop_sent/stage_input_receivedandstage_stream_chunk_sent/stage_stream_chunk_receiveddurations per (source, destination, kind). Terminal stage stream chunks are paired the same way with destinationcoordinator.
Hop pairs match across processes by (request_id, source_stage, dest_stage, chunk_id?), so a single request’s path through subprocesses can be
reconstructed even when each stage runs in its own process.
Torch profiler#
The torch profiler runs alongside the event recorder when
enable_torch=true (the default for /start_profile). It records
continuously between start() and stop() — no schedule(...), no
step() requirement — and exports a Chrome trace *.trace.json.gz on stop.
The expensive introspection flags are opt-in via env vars so the default
trace stays small enough to load in chrome://tracing or
ui.perfetto.dev:
Env var |
Effect |
|---|---|
|
Record input tensor shapes per op |
|
Track every CUDA caching-allocator alloc / free |
|
Record the Python (and C++) call stack per op |
|
Estimate FLOPs per op |
With all four off (the default), a typical 10-sample MMMU run produces a trace in the tens of MB. With all four on, the same workload can produce a multi-GB trace — only opt in when you need that specific information.
HTTP surface#
Method |
Path |
Body |
Notes |
|---|---|---|---|
POST |
|
|
Starts torch trace + event recorder. |
POST |
|
|
Stops torch trace + event recorder. Omitting |
POST |
|
|
Event recorder only — no torch trace. Lower overhead; safer to leave on. |
POST |
|
|
Same wildcard semantics as |
Example: record cheap events on every request without a kernel trace:
curl -X POST http://localhost:8000/start_request_profile \
-d '{"run_id":"demo","event_dir":"/tmp/profiles/demo/events"}'
# … run traffic …
curl -X POST http://localhost:8000/stop_request_profile -d '{}'
python -m sglang_omni.profiler /tmp/profiles/demo/events --format table
Discipline#
Profiling must never break serving. The emitter swallows write errors and counts drops; the first failure is logged once.
Tensors and large blobs stay out of event metadata. Keep metadata to small scalars (ids, counts, durations, modality, error strings). The recorder enforces this defensively: if a tensor / numpy array ends up in metadata,
_json_defaultserializes a summary ({"__tensor_summary__": true, "type": ..., "shape": [...], "dtype": "...", "device": "..."}) instead of materializing the contents. 0-D tensors / numpy scalars still serialize as plain scalars.Event naming. Lowercase snake_case, prefix with the layer that owns the event (
stage_*,scheduler_*,encoder_*, etc.). Use the stage name (not the event name) to distinguish “thinker prefill start” from “talker prefill start”.