Backend: SGLang Runtime (SRT)#
The SGLang Runtime (SRT) is an efficient serving engine.
Quick Start#
Launch a server
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
Send a request
curl http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Once upon a time,",
"sampling_params": {
"max_new_tokens": 16,
"temperature": 0
}
}'
Learn more about the argument specification, streaming, and multi-modal support here.
OpenAI Compatible API#
In addition, the server supports OpenAI-compatible APIs.
import openai
client = openai.Client(
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
# Text completion
response = client.completions.create(
model="default",
prompt="The capital of France is",
temperature=0,
max_tokens=32,
)
print(response)
# Chat completion
response = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response)
# Text embedding
response = client.embeddings.create(
model="default",
input="How are you today",
)
print(response)
It supports streaming, vision, and almost all features of the Chat/Completions/Models/Batch endpoints specified by the OpenAI API Reference.
Additional Server Arguments#
To enable multi-GPU tensor parallelism, add
--tp 2
. If it reports the error “peer access is not supported between these two devices”, add--enable-p2p-check
to the server launch command.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
To enable multi-GPU data parallelism, add
--dp 2
. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of
--mem-fraction-static
. The default value is0.9
.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
See hyperparameter tuning on tuning hyperparameters for better performance.
If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
To enable the experimental overlapped scheduler, add
--enable-overlap-scheduler
. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly.To enable torch.compile acceleration, add
--enable-torch-compile
. It accelerates small models on small batch sizes. This does not work for FP8 currenly.To enable torchao quantization, add
--torchao-config int4wo-128
. It supports various quantization strategies.To enable fp8 weight quantization, add
--quantization fp8
on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.To enable fp8 kv cache quantization, add
--kv-cache-dtype fp8_e5m2
.If the model does not have a chat template in the Hugging Face tokenizer, you can specify a custom chat template.
To run tensor parallelism on multiple nodes, add
--nnodes 2
. If you have two nodes with two GPUs on each node and want to run TP=4, letsgl-dev-0
be the hostname of the first node and50000
be an available port, you can use the following commands. If you meet deadlock, please try to add--disable-cuda-graph
# Node 0
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 0
# Node 1
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
Engine Without HTTP Server#
We also provide an inference engine without a HTTP server. For example,
import sglang as sgl
def main():
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print("===============================")
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
if __name__ == "__main__":
main()
This can be used for offline batch inference and building custom servers. You can view the full example here.
Supported Models#
Generative Models
Llama / Llama 2 / Llama 3 / Llama 3.1
Mistral / Mixtral / Mistral NeMo
Gemma / Gemma 2
Qwen / Qwen 2 / Qwen 2 MoE / Qwen 2 VL
DeepSeek / DeepSeek 2
OLMoE
-
python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --port=30000 --chat-template=chatml-llava
python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8 --chat-template=chatml-llava
Query the server with the OpenAI Vision API. See examples at test/srt/test_vision_openai_server.py
LLaVA 1.5 / 1.6 / NeXT
python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000 --tp-size=1 --chat-template=llava_llama_3
python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --port=30000 --tp-size=8 --chat-template=chatml-llava
Query the server with the OpenAI Vision API. See examples at test/srt/test_vision_openai_server.py
Yi-VL
StableLM
Command-R
DBRX
Grok
ChatGLM
InternLM 2
Exaone 3
BaiChuan2
MiniCPM / MiniCPM 3
XVERSE / XVERSE MoE
SmolLM
GLM-4
Embedding Models
e5-mistral
gte-Qwen2
python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding
Instructions for supporting a new model are here.
Use Models From ModelScope#
More
To use a model from ModelScope, set the environment variable SGLANG_USE_MODELSCOPE.
export SGLANG_USE_MODELSCOPE=true
Launch Qwen2-7B-Instruct Server
SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000
Or start it by docker.
docker run --gpus all \
-p 30000:30000 \
-v ~/.cache/modelscope:/root/.cache/modelscope \
--env "SGLANG_USE_MODELSCOPE=true" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000
Run Llama 3.1 405B#
More
# Run 405B (fp8) on a single node
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
# Run 405B (fp16) on two nodes
## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port
GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 --disable-cuda-graph
## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port
GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph
Benchmark Performance#
Benchmark a single static batch by running the following command without launching a server. The arguments are the same as for
launch_server.py
. Note that this is not a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this unit test does not. For accurate large batch testing, please usesglang.bench_serving
instead.python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
Benchmark online serving. Launch a server first and run the following command.
python3 -m sglang.bench_serving --backend sglang --num-prompt 10