Quick Start: Sending Requests#
This notebook provides a quick-start guide for using SGLang after installation.
Launch a server#
This code block is equivalent to executing
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0
in your terminal and wait for the server to be ready.
[1]:
from sglang.utils import (
execute_shell_command,
wait_for_server,
terminate_process,
print_highlight,
)
server_process = execute_shell_command(
"""
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0
"""
)
wait_for_server("http://localhost:30000")
[2024-11-02 03:47:51] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=236831321, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)
[2024-11-02 03:48:07 TP0] Init torch distributed begin.
[2024-11-02 03:48:08 TP0] Load weight begin. avail mem=78.59 GB
[2024-11-02 03:48:09 TP0] lm_eval is not installed, GPTQ may not be usable
INFO 11-02 03:48:09 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.17it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.04it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.22it/s]
[2024-11-02 03:48:13 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.50 GB
[2024-11-02 03:48:13 TP0] Memory pool end. avail mem=8.37 GB
[2024-11-02 03:48:13 TP0] Capture cuda graph begin. This can take up to several minutes.
[2024-11-02 03:48:21 TP0] max_total_num_tokens=442913, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2024-11-02 03:48:21] INFO: Started server process [3456625]
[2024-11-02 03:48:21] INFO: Waiting for application startup.
[2024-11-02 03:48:21] INFO: Application startup complete.
[2024-11-02 03:48:21] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2024-11-02 03:48:22] INFO: 127.0.0.1:50582 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-11-02 03:48:22 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-02 03:48:22] INFO: 127.0.0.1:50604 - "GET /v1/models HTTP/1.1" 200 OK
[2024-11-02 03:48:22] INFO: 127.0.0.1:50590 - "POST /generate HTTP/1.1" 200 OK
[2024-11-02 03:48:22] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
Send a Request#
Once the server is up, you can send test requests using curl. The server implements the OpenAI-compatible API.
[2]:
import subprocess
curl_command = """
curl http://localhost:30000/v1/chat/completions \\
-H "Content-Type: application/json" \\
-H "Authorization: Bearer None" \\
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is an LLM? Tell me in one sentence."
}
]
}'
"""
response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
[2024-11-02 03:48:27 TP0] Prefill batch. #new-seq: 1, #new-token: 52, #cached-token: 1, cache hit rate: 1.67%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-02 03:48:27 TP0] Decode batch. #running-req: 1, #token: 86, token usage: 0.00, gen throughput (token/s): 5.87, #queue-req: 0
[2024-11-02 03:48:27] INFO: 127.0.0.1:50650 - "POST /v1/chat/completions HTTP/1.1" 200 OK
100 891 100 613 100 278 1498 679 --:--:-- --:--:-- --:--:-- 2178
{"id":"30375befa17a47c690a4f2a00380d149","object":"chat.completion","created":1730519307,"model":"meta-llama/Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"An LLM, short for Large Language Model, is a type of artificial intelligence (AI) designed to process, analyze, and generate human-like language, often used in applications such as chatbots, voice assistants, and natural language processing tasks."},"logprobs":null,"finish_reason":"stop","matched_stop":128009}],"usage":{"prompt_tokens":53,"total_tokens":103,"completion_tokens":50,"prompt_tokens_details":null}}
Using OpenAI Python Client#
You can use the OpenAI Python API library to send requests.
[3]:
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(response)
[2024-11-02 03:48:28 TP0] Prefill batch. #new-seq: 1, #new-token: 20, #cached-token: 29, cache hit rate: 27.52%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-02 03:48:28 TP0] Decode batch. #running-req: 1, #token: 73, token usage: 0.00, gen throughput (token/s): 56.67, #queue-req: 0
[2024-11-02 03:48:28] INFO: 127.0.0.1:40886 - "POST /v1/chat/completions HTTP/1.1" 200 OK
ChatCompletion(id='a67d781ba00f424699a92975b3ec752c', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\n\n1. **Country:** Japan\n**Capital:** Tokyo\n\n2. **Country:** Australia\n**Capital:** Canberra\n\n3. **Country:** Brazil\n**Capital:** Brasília', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None), matched_stop=128009)], created=1730519308, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=46, prompt_tokens=49, total_tokens=95, completion_tokens_details=None, prompt_tokens_details=None))
Using Native Generation APIs#
You can also use the native /generate
endpoint. It provides more flexiblity. An API reference is available at Sampling Parameters.
[4]:
import requests
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 32,
},
},
)
print_highlight(response.json())
[2024-11-02 03:48:28 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 3, cache hit rate: 28.70%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-02 03:48:28 TP0] Decode batch. #running-req: 1, #token: 25, token usage: 0.00, gen throughput (token/s): 130.91, #queue-req: 0
[2024-11-02 03:48:28] INFO: 127.0.0.1:40896 - "POST /generate HTTP/1.1" 200 OK
{'text': ' a city of romance, art, fashion, and history. Paris is a must-visit destination for anyone who loves culture, architecture, and cuisine. From the', 'meta_info': {'prompt_tokens': 6, 'completion_tokens': 32, 'completion_tokens_wo_jump_forward': 32, 'cached_tokens': 3, 'finish_reason': {'type': 'length', 'length': 32}, 'id': '8d3c9c4af8e44548bcc42bb316b82fb6'}, 'index': 0}
[5]:
terminate_process(server_process)