Attention Backends#

This document describes the attention backends available in sglang diffusion (sglang.multimodal_gen) and how to select them.

Overview#

Attention backends are defined by AttentionBackendEnum (sglang.multimodal_gen.runtime.platforms.interface.AttentionBackendEnum) and selected via the CLI flag --attention-backend.

Backend selection is performed by the shared attention layers (e.g. LocalAttention / USPAttention / UlyssesAttention in sglang.multimodal_gen.runtime.layers.attention.layer) and therefore applies to any model component using these layers (e.g. diffusion transformer / DiT and encoders).

When using the diffusers backend, --attention-backend is passed through to diffusers’ set_attention_backend (e.g., flash, _flash_3_hub, sage, xformers, native).

  • CUDA: prefers FlashAttention (FA3/FA4) when supported; otherwise falls back to PyTorch SDPA.

  • ROCm: uses FlashAttention when available; otherwise falls back to PyTorch SDPA.

  • Intel XPU: uses XPU Flash Attention backend (fp16/bf16, head sizes 64/96/128/192/256); otherwise falls back to PyTorch SDPA.

  • MUSA: uses FlashAttention when available; otherwise falls back to PyTorch SDPA.

  • MPS: always uses PyTorch SDPA.

  • NPU: for ring attention uses FA otherwise uses PyTorch SDPA.

Backend options#

For SGLang-native pipelines, the CLI accepts the lowercase names of AttentionBackendEnum. The table below lists the backends implemented by the built-in platforms. fa3/fa4 are accepted as aliases for fa.

CLI value

Enum value

Notes

fa / fa3 / fa4

FA

FlashAttention. fa3/fa4 are normalized to fa during argument parsing (ServerArgs.__post_init__).

torch_sdpa

TORCH_SDPA

PyTorch scaled_dot_product_attention.

sliding_tile_attn

SLIDING_TILE_ATTN

Sliding Tile Attention (STA). Requires st_attn. Configure via --attention-backend-config.

sage_attn

SAGE_ATTN

Requires sageattention. Upstream SageAttention CUDA extensions target SM80/SM86/SM89/SM90/SM120 (compute capability 8.0/8.6/8.9/9.0/12.0); see upstream setup.py: https://github.com/thu-ml/SageAttention/blob/main/setup.py.

sage_attn_3

SAGE_ATTN_3

Requires SageAttention3 installed per upstream instructions.

video_sparse_attn

VIDEO_SPARSE_ATTN

Requires vsa. Configure sparsity via --attention-backend-config.

vmoba_attn

VMOBA_ATTN

Requires kernel.attn.vmoba_attn.vmoba. Configure via --attention-backend-config.

aiter

AITER

Requires aiter.

aiter_sage

AITER_SAGE

Requires aiter.

sla_attn

SLA_ATTN

Sparse Linear Attention. Requires SpargeAttn. Install with pip install git+https://github.com/thu-ml/SpargeAttn.git --no-build-isolation.

sage_sla_attn

SAGE_SLA_ATTN

SageAttention + Sparse Linear Attention. Requires SpargeAttn (same install as SLA).

sparse_video_gen_2_attn

SPARSE_VIDEO_GEN_2_ATTN

Requires svg. See installation instructions at https://github.com/svg-project/Sparse-VideoGen.

laser_attn

LASER_ATTN

Requires attentions which can be installed with sgl_kernel_npu; available only for NPU.

block_sparse_attn

BLOCK_SPARSE_ATTN

Requires attentions which can be installed with sgl_kernel_npu; available only for NPU.

rain_fusion_attn

RAIN_FUSION_ATTN

Requires attentions which can be installed with sgl_kernel_npu; available only for NPU.

Selection priority#

The selection order in runtime/layers/attention/selector.py is:

  1. global_force_attn_backend(...) / global_force_attn_backend_context_manager(...)

  2. Component override from --component-attention-backends while that component is being constructed

  3. CLI --attention-backend (ServerArgs.attention_backend)

  4. Auto selection (platform capability, dtype, and installed packages)

Configuration#

Some backends require additional configuration. You can pass these parameters via --attention-backend-config. This argument accepts:

  • A path to a JSON or YAML configuration file.

  • A JSON string (e.g., '{"sparsity": 0.5}').

  • Key-value pairs (e.g., "sparsity=0.5,enable_x=true").

Supported Configuration Parameters#

Sliding Tile Attention (sliding_tile_attn)

Parameter

Type

Description

Default

mask_strategy_file_path

str

Required. Path to the mask strategy JSON file.

-

sta_mode

str

Mode of STA.

STA_inference

skip_time_steps

int

Number of steps to use full attention before switching to sparse attention.

15

Video Sparse Attention (video_sparse_attn)

Parameter

Type

Description

Default

sparsity

float

Validation sparsity (0.0 - 1.0).

0.0

V-MoBA (vmoba_attn)

Parameter

Type

Description

Default

temporal_chunk_size

int

Chunk size for temporal dimension.

-

temporal_topk

int

Top-K tokens to select in temporal dimension.

-

spatial_chunk_size

list[int]

Chunk size for spatial dimension (H, W).

-

spatial_topk

int

Top-K tokens to select in spatial dimension.

-

st_chunk_size

list[int]

Chunk size for spatiotemporal dimension (T, H, W).

-

st_topk

int

Top-K tokens to select in spatiotemporal dimension.

-

moba_select_mode

str

Selection mode (e.g., threshold).

threshold

moba_threshold

float

Threshold value for selection.

0.25

moba_threshold_type

str

Type of thresholding (e.g., query_head).

query_head

first_full_step

int

Number of initial steps to use full attention.

12

first_full_layer

int

Number of initial layers to use full attention.

0

temporal_layer

int

Number of temporal layers.

1

spatial_layer

int

Number of spatial layers.

1

st_layer

int

Number of spatiotemporal layers.

1

Block Sparse attention ( block_sparse_attn ) and Rain Fusion attention ( rain_fusion_attn )

Parameter

Type

Description

Default

skip_first_steps

int

Number of steps to use laser attention before switching to sparse attention.

10

sparsity

float

The sparsity coefficient must be in the range (0, 1).

0.2

Platform support matrix#

Backend

CUDA

ROCm

XPU

MUSA

MPS

NPU

Notes

fa

CUDA requires SM80+ and fp16/bf16. XPU uses its own flash attention backend. FlashAttention is only used when the required runtime is installed; otherwise it falls back to torch_sdpa. No extra installations are required for NPU

torch_sdpa

Most compatible option across platforms.

sliding_tile_attn

CUDA-only. Requires st_attn. Configure via --attention-backend-config.

sage_attn

CUDA-only (optional dependency).

sage_attn_3

CUDA-only (optional dependency).

video_sparse_attn

CUDA-only. Requires vsa. Configure sparsity via --attention-backend-config.

sla_attn

CUDA-only. Requires SpargeAttn.

sage_sla_attn

CUDA-only. Requires SpargeAttn.

vmoba_attn

CUDA-only. Requires kernel.attn.vmoba_attn.vmoba. Configure via --attention-backend-config.

aiter

Requires aiter.

aiter_sage

Requires aiter.

sparse_video_gen_2_attn

CUDA-only. Requires svg.

laser_attn

NPU-only. Requires attentions from sgl_kernel_npu. Uses SDPA if seqlen is less than 2048.

block_sparse_attn

NPU-only. Requires attentions from sgl_kernel_npu. Configure via --attention-backend-config.

rain_fusion_attn

NPU-only. Requires attentions from sgl_kernel_npu. Configure via --attention-backend-config.

Usage#

Select a backend via CLI#

sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend fa
sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend torch_sdpa

Override one component#

Use component overrides when a specific module needs different attention semantics from the main transformer:

sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend fa \
  --component-attention-backends text_encoder=torch_sdpa

Component keys match pipeline module names from model_index.json, such as text_encoder, text_encoder_2, transformer, transformer_2, or connectors.

Using Sliding Tile Attention (STA)#

# Pass the mask strategy file path via config
sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend sliding_tile_attn \
  --attention-backend-config "mask_strategy_file_path=/abs/path/to/mask_strategy.json"

Notes for ROCm / MPS#

  • ROCm: use --attention-backend torch_sdpa or fa depending on what is available in your environment.

  • MPS: the platform implementation always uses torch_sdpa.