Ascend NPU Quickstart#
Prerequisites#
Supported Devices#
Atlas 800I A2 inference series (Atlas 800I A2)
Atlas 800I A3 inference series (Atlas 800I A3)
Setup environment using container#
Notice: The following commands are based on Atlas 800I A3 machines. If you are using Atlas 800I A2, some changes are needed.
The image tag needs to be
main-cann8.5.0-a3for Atlas 800I A3 andmain-cann8.5.0-910bfor Atlas 800I A2.The device mapping in
docker runcommand needs to be changed todavinci[0-7]for Atlas 800I A2.
# For Atlas 800I A3
export IMAGE=quay.io/ascend/sglang:main-cann8.5.0-a3
docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
--device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
--device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
--device=/dev/davinci8 --device=/dev/davinci9 --device=/dev/davinci10 --device=/dev/davinci11 \
--device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--volume /usr/local/sbin:/usr/local/sbin \
--volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
--volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--volume /etc/ascend_install.info:/etc/ascend_install.info \
--volume /var/queue_schedule:/var/queue_schedule \
--volume ~/.cache/:/root/.cache/ \
--entrypoint=bash \
$IMAGE
Usage#
The SGLang server is installed in the container by default. You can use pip show sglang to check the version.
Start SGLang server#
SGLang will automatically download the model from Hugging Face.
# Set HF_ENDPOINT to a mirror site if network is not available
export HF_ENDPOINT=https://hf-mirror.com
# Set your own HF_TOKEN to download restricted models
export HF_TOKEN=<secret>
# Start SGLang server
# It may take several minutes to download the model on the first run
sglang serve --model-path Qwen/Qwen2.5-7B-Instruct --attention-backend ascend &
If you see output like the following, the server is running.
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
The server is fired up and ready to roll!
Send a test request#
You can do inference using the server:
curl -X POST http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 16
}
}'
If the “text” field in the response contains “Paris”, the server is working as expected.
Stop server and exit container#
The SGLang server is running as a background process. You can send a SIGINT signal to stop it.
SGLANG_PID=$(pgrep -f "sglang serve")
kill -SIGINT $SGLANG_PID
The output should be like the following:
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [25310]
The server has now stopped. You can verify it with ps -ef | grep sglang, then exit the container by pressing Ctrl+D.