Quantization on Ascend#

To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there’s no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.

SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers.

ModelSlim on Ascend support

Quantization scheme	`quant_type` in JSON	Scheme class	Layer type	A2 Supported	A3 Supported	A5 Supported	Diffusion models
W4A4 dynamic	`W4A4_DYNAMIC`	`ModelSlimW4A4Int4`	Linear	√	√	TBD	√
W8A8 static	`W8A8`	`ModelSlimW8A8Int8`	Linear	√	√	TBD	√
W8A8 dynamic	`W8A8_DYNAMIC`	`ModelSlimW8A8Int8`	Linear	√	√	TBD	√
MXFP8	`W8A8_MXFP8`	`ModelSlimMXFP8Scheme`	Linear	x	x	WIP	√ (A5)
W4A4 dynamic	`W4A4_DYNAMIC`	`ModelSlimW4A4Int4`	MoE	√	√	TBD	x
W4A8 dynamic	`W4A8_DYNAMIC`	`ModelSlimW4A8Int8MoE`	MoE	√	√	TBD	x
W8A8 dynamic	`W8A8_DYNAMIC`	`ModelSlimW8A8Int8`	MoE	√	√	TBD	x
MXFP8	`W8A8_MXFP8`	`ModelSlimMXFP8Scheme`	MoE	x	x	WIP	x

AWQ on Ascend support:

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W4A16	Linear	√	√	TBD
W8A16	Linear	√	√	TBD
W4A16	MoE	√	√	TBD

GPTQ on Ascend support

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W4A16	Linear	√	√	TBD
W8A16	Linear	√	√	TBD
W4A16 MOE	MoE	√	√	TBD
W8A16 MOE	MoE	√	√	TBD

Auto-round on Ascend support

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W4A16	Linear	√	√	TBD
W8A16	Linear	√	√	TBD
W4A16	MoE	√	√	TBD
W8A16	MoE	√	√	TBD

Compressed-tensors (LLM Compressor) on Ascend support:

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W8A8 dynamic	Linear	√	√	TBD
W4A8 dynamic with/without activation clip	MoE	√	√	TBD
W4A16 MOE	MoE	√	√	TBD
W8A8 dynamic	MoE	√	√	TBD

GGUF on Ascend support

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
GGUF (all types)	Linear	√	√	TBD
GGUF (all types)	MoE	√	√	TBD

Note: On Ascend, GGUF weights are pre-dequantized to FP16/BF16 during model loading to ensure optimal inference performance. This enables support for all GGUF quantization types (Q2_K, Q4_K_M, IQ4_XS, etc.) while maintaining high inference speed.

in progress

Diffusion Model Quantization on Ascend NPU#

SGLang-Diffusion supports MXFP8 online and offline quantization for diffusion models (such as Wan2.2) on Ascend NPUs. MXFP8 requires A5; the ModelSlim W8A8/W4A4 schemes work on A2/A3.

Requirements for MXFP8: CANN ≥ 8.0.RC3, Ascend A5

Quantization method	`quant_type` in JSON	Scheme class	Mode	A2/A3 Supported	A5 Supported	Trigger
MXFP8 (W8A8)	—	`MXFP8Config`	Online	x	√	`--quantization mxfp8`
MXFP8 (W8A8)	`W8A8_MXFP8`	`ModelSlimMXFP8Scheme`	Offline	x	√	auto-detected from `quant_model_description.json`
W8A8 static	`W8A8`	`ModelSlimW8A8Int8`	Offline	√	TBD	auto-detected from `quant_model_description.json`
W8A8 dynamic	`W8A8_DYNAMIC`	`ModelSlimW8A8Int8`	Offline	√	TBD	auto-detected from `quant_model_description.json`
W4A4 dynamic	`W4A4_DYNAMIC`	`ModelSlimW4A4Int4`	Offline	√	TBD	auto-detected from `quant_model_description.json`

Online MXFP8 Quantization#

Online quantization dynamically quantizes FP16/BF16 weights to MXFP8 at load time using npu_dynamic_mx_quant + npu_quant_matmul CANN kernels. Pass --quantization mxfp8 to override auto-detection.

# Start the diffusion server with online MXFP8 quantization
sglang serve \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --quantization mxfp8 \
  --num-gpus 4

# One-shot generation
sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --quantization mxfp8 \
  --prompt "a beautiful sunset over the mountains" \
  --save-output

Offline MXFP8 Quantization (ModelSlim)#

For offline quantization, pre-quantize the model with msModelSlim and load the resulting checkpoint. The quantization scheme is auto-detected from quant_model_description.json, so no extra --quantization flag is needed.

Step 1: Quantize with msModelSlim

msmodelslim quant \
  --model_path /path/to/wan2_2_float_weights \
  --save_path /path/to/wan2_2_mxfp8_weights \
  --device npu \
  --model_type Wan2_2 \
  --quant_type mxfp8 \
  --trust_remote_code True

Note: SGLang does not support quantized embeddings; disable embedding quantization when using msmodelslim.

Step 2: Convert to Diffusers format

msModelSlim saves quantized Wan2.2 weights in the original Wan format. Convert to Diffusers format using the provided repack script:

python python/sglang/multimodal_gen/tools/wan_repack.py \
  --input-path /path/to/wan2_2_mxfp8_weights \
  --output-path /path/to/wan2_2_mxfp8_diffusers

Then copy all files from the original Diffusers checkpoint (except the transformer/transformer_2 folders) into the output directory.

Step 3: Run inference

sglang generate \
  --model-path /path/to/wan2_2_mxfp8_diffusers \
  --prompt "a beautiful sunset over the mountains" \
  --save-output

For pre-quantized checkpoints available on ModelScope, see modelscope/Eco-Tech.

Quantization on Ascend

Contents

Quantization on Ascend#

Diffusion Model Quantization on Ascend NPU#

Online MXFP8 Quantization#

Offline MXFP8 Quantization (ModelSlim)#