Adaptive Speculative Decoding#
Adaptive speculative decoding lets SGLang adjust speculative_num_steps/speculative_num_draft_tokens at runtime instead of keeping a single fixed value for the whole server lifetime.
It is designed for workloads whose accept length changes over time, where one static step count is rarely optimal.
Current support#
Only
--speculative-algorithm EAGLEOnly
--speculative-eagle-topk 1If either condition is not met, SGLang falls back to static speculative settings
Why adaptive steps help#
speculative_num_steps controls how many draft-model autoregressive steps run in each speculative round. In practice, the best value depends on the current workload.
If
num_stepsis too small, the draft model could have produced more accepted tokens, but the round stops too early.If
num_stepsis too large, the draft model produces many candidate tokens that the target model rejects, so extra draft work is wasted.
Real traffic often moves between high-acceptance and low-acceptance phases, so one fixed step count is usually a compromise. Adaptive mode tries to follow the workload instead of hard-coding a single global num_steps.
Design overview#
The adaptive mechanism has three pieces:
AdaptiveSpeculativeParams: the EMA-based policySpecRuntimeState: the per-tier runtime state bundleAdaptiveController: the coordinator that chooses a tier and activates the matching runtime state
At startup, SGLang pre-builds one runtime state per candidate tier. By default, the candidate tiers are candidate_steps = [1, 3, 7].
┌──────────────────────────────────────────────────────────┐
│ SpecRuntimeState │
│ │
│ speculative_num_steps / speculative_num_draft_tokens │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌──────────────┐ │
│ │ Draft stage │ │ Verify stage │ │ Extend stage │ │
│ │ │ │ │ │ │ │
│ │ attn_backend │ │ attn_backend │ │ attn_backend │ │
│ │ cuda_graph │ │ cuda_graph │ │ cuda_graph │ │
│ └────────────────┘ └────────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────────┘
This matters because CudaGraphRunner is shape-dependent. Each candidate tier owns its own graph and backend state, so runtime switching is a reference swap, not an online graph recapture.
Runtime flow#
The adaptive update happens after verify and affects the next round, not the current one:
┌─────────────────────────────────────────────────────────────────────┐
│ EAGLEWorker.forward_batch_generation() — decode path │
│ │
│ ① draft(batch) │
│ │ draft model multi-step generation with current tier │
│ v │
│ ② verify(batch, spec_info) │
│ │ target model tree verification │
│ │ → produces accept_length_per_req │
│ v │
│ ③ forward_draft_extend_after_decode(batch) │
│ │ draft model KV-cache catch-up │
│ v │
│ ④ adaptive_controller.on_verify_complete(accept_lengths) │
│ │ │
│ │ update EMA, apply warmup / interval / hysteresis gates │
│ │ if tier changed, select a pre-built state from pool │
│ v │
│ worker.apply_runtime_state(state) │
│ │
│ Tier switch happens after the current round completes. │
│ Backends and CUDA graphs are never swapped mid-round. │
└─────────────────────────────────────────────────────────────────────┘
How the policy decides#
After each verify pass, SGLang reads the accepted draft length per request, computes the batch average, smooths it with an exponential moving average (EMA), and switches among the pre-built candidate tiers [1, 3, 7] by default.
The decision logic is intentionally conservative:
warmup_batchesskips the first few batchesupdate_intervalavoids switching every batchdown_hysteresisandup_hysteresisreduce oscillation
Conceptually, the policy probes one step beyond the observed acceptance:
target_steps ≈ clamp(round(ema_accept_len) + 1, min(candidate_steps), max(candidate_steps))
So if recent requests consistently accept more drafted tokens, the policy tends to move up. If they start rejecting earlier, it tends to move down.
Usage#
--speculative-adaptive-config is optional, but the speculative setup still needs to be valid for adaptive mode.
python3 -m sglang.launch_server \
--model meta-llama/Llama-2-7b-chat-hf \
--speculative-algorithm EAGLE \
--speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
--speculative-eagle-topk 1 \
--speculative-num-steps 3 \
--speculative-num-draft-tokens 4 \
--speculative-adaptive
If you want to override the defaults, add --speculative-adaptive-config /path/to/adaptive_spec.json.
Example config:
{
"candidate_steps": [1, 3, 7],
"ema_alpha": 0.2,
"warmup_batches": 10,
"update_interval": 5
}
Config file reference#
The config file is optional. Any omitted keys use defaults.
Key |
Default |
Meaning |
|---|---|---|
|
|
Discrete |
|
|
EMA smoothing factor for accepted draft length |
|
|
Recompute interval, in verify batches, after warmup |
|
|
Number of verify batches to observe before switching |
|
|
Extra margin before moving to a smaller step |
|
|
Extra margin before moving to a larger step |
The initial --speculative-num-steps is snapped to the nearest value in candidate_steps.
Monitoring#
You can inspect the active tier and acceptance metric via /server_info:
curl -s http://127.0.0.1:30000/server_info | jq '.internal_states[0] | {speculative_num_steps, avg_spec_accept_length}'
speculative_num_stepsis the current active tieravg_spec_accept_lengthhelps explain whether the server is likely to move up or down
Tuning tips#
Start with the default candidate tiers
[1, 3, 7]Use fewer tiers if you want lower startup and graph-memory overhead
Increase
ema_alphato react faster, or lower it for more stabilityIncrease
warmup_batchesorupdate_intervalif tier switching is too noisyIf your workload is already stable and one static setting is well tuned, adaptive mode may not help much