Quantization on Ascend#
To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there’s no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.
SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers.
Quantization scheme |
|
Scheme class |
Layer type |
A2 Supported |
A3 Supported |
A5 Supported |
Diffusion models |
|---|---|---|---|---|---|---|---|
W4A4 dynamic |
|
|
Linear |
√ |
√ |
TBD |
√ |
W8A8 static |
|
|
Linear |
√ |
√ |
TBD |
√ |
W8A8 dynamic |
|
|
Linear |
√ |
√ |
TBD |
√ |
|
|
Linear |
x |
x |
WIP |
√ (A5) |
|
W4A4 dynamic |
|
|
MoE |
√ |
√ |
TBD |
x |
W4A8 dynamic |
|
|
MoE |
√ |
√ |
TBD |
x |
W8A8 dynamic |
|
|
MoE |
√ |
√ |
TBD |
x |
|
|
MoE |
x |
x |
WIP |
x |
Quantization scheme |
Layer type |
A2 Supported |
A3 Supported |
A5 Supported |
|---|---|---|---|---|
W4A16 |
Linear |
√ |
√ |
TBD |
W8A16 |
Linear |
√ |
√ |
TBD |
W4A16 |
MoE |
√ |
√ |
TBD |
GPTQ on Ascend support
Quantization scheme |
Layer type |
A2 Supported |
A3 Supported |
A5 Supported |
|---|---|---|---|---|
Linear |
√ |
√ |
TBD |
|
Linear |
√ |
√ |
TBD |
|
MoE |
√ |
√ |
TBD |
|
MoE |
√ |
√ |
TBD |
Quantization scheme |
Layer type |
A2 Supported |
A3 Supported |
A5 Supported |
|---|---|---|---|---|
W4A16 |
Linear |
√ |
√ |
TBD |
W8A16 |
Linear |
√ |
√ |
TBD |
W4A16 |
MoE |
√ |
√ |
TBD |
W8A16 |
MoE |
√ |
√ |
TBD |
Compressed-tensors (LLM Compressor) on Ascend support:
Quantization scheme |
Layer type |
A2 Supported |
A3 Supported |
A5 Supported |
|---|---|---|---|---|
Linear |
√ |
√ |
TBD |
|
MoE |
√ |
√ |
TBD |
|
MoE |
√ |
√ |
TBD |
|
MoE |
√ |
√ |
TBD |
Quantization scheme |
Layer type |
A2 Supported |
A3 Supported |
A5 Supported |
|---|---|---|---|---|
Linear |
√ |
√ |
TBD |
|
MoE |
√ |
√ |
TBD |
Note: On Ascend, GGUF weights are pre-dequantized to FP16/BF16 during model loading to ensure optimal inference performance. This enables support for all GGUF quantization types (Q2_K, Q4_K_M, IQ4_XS, etc.) while maintaining high inference speed.
in progress
Diffusion Model Quantization on Ascend NPU#
SGLang-Diffusion supports MXFP8 online and offline quantization for diffusion models (such as Wan2.2) on Ascend NPUs. MXFP8 requires A5; the ModelSlim W8A8/W4A4 schemes work on A2/A3.
Requirements for MXFP8: CANN ≥ 8.0.RC3, Ascend A5
Quantization method |
|
Scheme class |
Mode |
A2/A3 Supported |
A5 Supported |
Trigger |
|---|---|---|---|---|---|---|
MXFP8 (W8A8) |
— |
|
Online |
x |
√ |
|
MXFP8 (W8A8) |
|
|
Offline |
x |
√ |
auto-detected from |
W8A8 static |
|
|
Offline |
√ |
TBD |
auto-detected from |
W8A8 dynamic |
|
|
Offline |
√ |
TBD |
auto-detected from |
W4A4 dynamic |
|
|
Offline |
√ |
TBD |
auto-detected from |
Online MXFP8 Quantization#
Online quantization dynamically quantizes FP16/BF16 weights to MXFP8 at load time using npu_dynamic_mx_quant + npu_quant_matmul CANN kernels. Pass --quantization mxfp8 to override auto-detection.
# Start the diffusion server with online MXFP8 quantization
sglang serve \
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
--quantization mxfp8 \
--num-gpus 4
# One-shot generation
sglang generate \
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
--quantization mxfp8 \
--prompt "a beautiful sunset over the mountains" \
--save-output
Offline MXFP8 Quantization (ModelSlim)#
For offline quantization, pre-quantize the model with msModelSlim and load the resulting checkpoint. The quantization scheme is auto-detected from quant_model_description.json, so no extra --quantization flag is needed.
Step 1: Quantize with msModelSlim
msmodelslim quant \
--model_path /path/to/wan2_2_float_weights \
--save_path /path/to/wan2_2_mxfp8_weights \
--device npu \
--model_type Wan2_2 \
--quant_type mxfp8 \
--trust_remote_code True
Note: SGLang does not support quantized embeddings; disable embedding quantization when using msmodelslim.
Step 2: Convert to Diffusers format
msModelSlim saves quantized Wan2.2 weights in the original Wan format. Convert to Diffusers format using the provided repack script:
python python/sglang/multimodal_gen/tools/wan_repack.py \
--input-path /path/to/wan2_2_mxfp8_weights \
--output-path /path/to/wan2_2_mxfp8_diffusers
Then copy all files from the original Diffusers checkpoint (except the transformer/transformer_2 folders) into the output directory.
Step 3: Run inference
sglang generate \
--model-path /path/to/wan2_2_mxfp8_diffusers \
--prompt "a beautiful sunset over the mountains" \
--save-output
For pre-quantized checkpoints available on ModelScope, see modelscope/Eco-Tech.