Vision Language Model#
SGLang supports vision language models in the same way as completion models. Here are some example models:
Launch A Server#
The following code is equivalent to running this in the shell:
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--port=30010 --chat-template=llama_3_vision
Remember to add --chat-template=llama_3_vision
to specify the vision chat template, otherwise the server only supports text.
[1]:
from sglang.utils import (
execute_shell_command,
wait_for_server,
terminate_process,
print_highlight,
)
embedding_process = execute_shell_command(
"""
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--port=30010 --chat-template=llama_3_vision
"""
)
wait_for_server("http://localhost:30010")
[2024-11-01 07:55:46] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer_path='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-11B-Vision-Instruct', chat_template='llama_3_vision', is_embedding=False, host='127.0.0.1', port=30010, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=606882579, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)
[2024-11-01 07:55:53] Use chat template for the OpenAI-compatible API server: llama_3_vision
[2024-11-01 07:56:03 TP0] Automatically turn off --chunked-prefill-size and adjust --mem-fraction-static for multimodal models.
[2024-11-01 07:56:03 TP0] Init torch distributed begin.
[2024-11-01 07:56:04 TP0] Load weight begin. avail mem=78.59 GB
[2024-11-01 07:56:04 TP0] lm_eval is not installed, GPTQ may not be usable
INFO 11-01 07:56:04 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:01<00:04, 1.20s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:02<00:03, 1.26s/it]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:03<00:02, 1.28s/it]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:05<00:01, 1.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:05<00:00, 1.07s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:05<00:00, 1.16s/it]
[2024-11-01 07:56:10 TP0] Load weight end. type=MllamaForConditionalGeneration, dtype=torch.bfloat16, avail mem=58.43 GB
[2024-11-01 07:56:10 TP0] Memory pool end. avail mem=11.80 GB
[2024-11-01 07:56:10 TP0] Capture cuda graph begin. This can take up to several minutes.
[2024-11-01 07:56:21 TP0] max_total_num_tokens=298440, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2024-11-01 07:56:22] INFO: Started server process [1241850]
[2024-11-01 07:56:22] INFO: Waiting for application startup.
[2024-11-01 07:56:22] INFO: Application startup complete.
[2024-11-01 07:56:22] INFO: Uvicorn running on http://127.0.0.1:30010 (Press CTRL+C to quit)
[2024-11-01 07:56:22] INFO: 127.0.0.1:44614 - "GET /v1/models HTTP/1.1" 200 OK
[2024-11-01 07:56:23] INFO: 127.0.0.1:44616 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-11-01 07:56:23 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-01 07:56:23] INFO: 127.0.0.1:44620 - "POST /generate HTTP/1.1" 200 OK
[2024-11-01 07:56:23] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
Use Curl#
[2]:
import subprocess, json, os
curl_command = """
curl http://localhost:30010/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer None" \
-d '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What’s in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
],
"max_tokens": 300
}'
"""
response = json.loads(subprocess.check_output(curl_command, shell=True))
print_highlight(response)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 559 0 0 100 559 0 174 0:00:03 0:00:03 --:--:-- 174
/actions-runner/_work/_tool/Python/3.9.20/x64/lib/python3.9/site-packages/torch/storage.py:414: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
return torch.load(io.BytesIO(b))
[2024-11-01 07:56:31 TP0] Prefill batch. #new-seq: 1, #new-token: 6463, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
100 559 0 0 100 559 0 132 0:00:04 0:00:04 --:--:-- 132
[2024-11-01 07:56:32 TP0] Decode batch. #running-req: 1, #token: 6496, token usage: 0.02, gen throughput (token/s): 3.66, #queue-req: 0
[2024-11-01 07:56:32 TP0] Decode batch. #running-req: 1, #token: 6536, token usage: 0.02, gen throughput (token/s): 99.00, #queue-req: 0
100 559 0 0 100 559 0 107 0:00:05 0:00:05 --:--:-- 107
[2024-11-01 07:56:33] INFO: 127.0.0.1:44624 - "POST /v1/chat/completions HTTP/1.1" 200 OK
100 1434 100 875 100 559 160 102 0:00:05 0:00:05 --:--:-- 269
Using OpenAI Compatible API#
[3]:
import base64, requests
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30010/v1", api_key="None")
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
def download_image(image_url, image_path):
response = requests.get(image_url)
response.raise_for_status()
with open(image_path, "wb") as f:
f.write(response.content)
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
image_path = "boardwalk.jpeg"
download_image(image_url, image_path)
base64_image = encode_image(image_path)
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?",
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
},
],
}
],
max_tokens=300,
)
print_highlight(response.choices[0].message.content)
[2024-11-01 07:56:39 TP0] Prefill batch. #new-seq: 1, #new-token: 6463, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-01 07:56:39 TP0] Decode batch. #running-req: 1, #token: 6473, token usage: 0.02, gen throughput (token/s): 6.15, #queue-req: 0
[2024-11-01 07:56:39] INFO: 127.0.0.1:47312 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Multiple Images Input#
[4]:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30010/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Are there any differences between these two images?",
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
],
}
],
max_tokens=300,
)
print(response.choices[0])
[2024-11-01 07:56:40 TP0] Prefill batch. #new-seq: 1, #new-token: 12871, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-01 07:56:40 TP0] Decode batch. #running-req: 1, #token: 12891, token usage: 0.04, gen throughput (token/s): 24.88, #queue-req: 0
[2024-11-01 07:56:41 TP0] Decode batch. #running-req: 1, #token: 12931, token usage: 0.04, gen throughput (token/s): 104.23, #queue-req: 0
[2024-11-01 07:56:41] INFO: 127.0.0.1:47314 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The two images depict a serene and peaceful natural setting, with the first featuring a serene lake and the second, a boardwalk in a field. Both images share a sense of tranquility and calmness, inviting the viewer to step into the natural world. The lake image evokes feelings of relaxation and tranquility, while the boardwalk image suggests a sense of adventure and exploration. Both images evoke a sense of connection to nature and the great outdoors.', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None), matched_stop=128009)
[5]:
terminate_process(embedding_process)
os.remove(image_path)
Chat Template#
As mentioned before, if you do not specify a vision model’s chat-template
, the server uses Hugging Face’s default template, which only supports text.
You can add your custom chat template by referring to the custom chat template.
We list popular vision models with their chat templates:
meta-llama/Llama-3.2-Vision uses
llama_3_vision
.LLaVA-NeXT uses
chatml-llava
.llama3-llava-next uses
llava_llama_3
.llava-onevision uses
chatml-llava
.liuhaotian/llava-v1.5 / 1.6 uses
vicuna_v1.1
.