OpenAI APIs - Vision#
SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. A complete reference for the API is available in the OpenAI API Reference. This tutorial covers the vision APIs for vision language models.
SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and more.
As an alternative to the OpenAI API, you can also use the SGLang offline engine.
Launch A Server#
Launch the server in your terminal and wait for it to initialize.
[1]:
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
example_image_url = "https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png"
logo_image_url = (
"https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png"
)
vision_process, port = launch_server_cmd("""
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --log-level warning
""")
wait_for_server(f"http://localhost:{port}", process=vision_process)
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.9.1+cu130).
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.9.1+cu130).
/actions-runner/_work/sglang/sglang/python/sglang/launch_server.py:54: UserWarning: 'python -m sglang.launch_server' is still supported, but 'sglang serve' is the recommended entrypoint.
Example: sglang serve --model-path <model> [options]
warnings.warn(
[2026-04-24 11:40:42] No platform detected. Using base SRTPlatform with defaults.
`torch_dtype` is deprecated! Use `dtype` instead!
[2026-04-24 11:40:44] `torch_dtype` is deprecated! Use `dtype` instead!
`BaseImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `BaseImageProcessor` instead.
[2026-04-24 11:40:45] `BaseImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `BaseImageProcessor` instead.
The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
[2026-04-24 11:40:48] The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.9.1+cu130).
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.9.1+cu130).
No platform detected. Using base SRTPlatform with defaults.
No platform detected. Using base SRTPlatform with defaults.
`BaseImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `BaseImageProcessor` instead.
`BaseImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `BaseImageProcessor` instead.
`torch_dtype` is deprecated! Use `dtype` instead!
[2026-04-24 11:40:53] `torch_dtype` is deprecated! Use `dtype` instead!
The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
[2026-04-24 11:40:54] The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Multi-thread loading shards: 100% Completed | 5/5 [00:03<00:00, 1.63it/s]
/usr/local/lib/python3.10/dist-packages/fastapi/routing.py:120: FastAPIDeprecationWarning: ORJSONResponse is deprecated, FastAPI now serializes data directly to JSON bytes via Pydantic when a return type or response model is set, which is faster and doesn't need a custom response class. Read more in the FastAPI docs: https://fastapi.tiangolo.com/advanced/custom-response/#orjson-or-response-model and https://fastapi.tiangolo.com/tutorial/response-model/
response = await f(request)
2026-04-24 11:41:05,965 - CUTE_DSL - WARNING - [handle_import_error] - Unexpected error during package walk: cutlass.cute.experimental
[2026-04-24 11:41:05] Unexpected error during package walk: cutlass.cute.experimental
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
To reduce the log length, we set the log level to warning for the server, the default log level is info.
We are running those notebooks in a CI environment, so the throughput is not representative of the actual performance.
Using cURL#
Once the server is up, you can send test requests using curl or requests.
[2]:
import subprocess
curl_command = f"""
curl -s http://localhost:{port}/v1/chat/completions \\
-H "Content-Type: application/json" \\
-d '{{
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{{
"role": "user",
"content": [
{{
"type": "text",
"text": "What’s in this image?"
}},
{{
"type": "image_url",
"image_url": {{
"url": "{example_image_url}"
}}
}}
]
}}
],
"max_tokens": 300
}}'
"""
response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)
response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)
/actions-runner/_work/sglang/sglang/python/sglang/srt/utils/common.py:811: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1581.)
encoded_image = torch.frombuffer(image_bytes, dtype=torch.uint8)
{"id":"1247d51e486e42a7963ba8e1c41b9c44","object":"chat.completion","created":1777030872,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image shows a man standing on the back of a yellow taxi, ironing a blue shirt. The taxi is parked on a city street with other vehicles and buildings in the background. The man appears to be balancing on the tailgate while performing this task. The scene suggests an unusual or humorous situation, as ironing clothes outdoors is not a typical activity.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":380,"completion_tokens":73,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
{"id":"e2a09bb0ff2d496fac68169cc68d8364","object":"chat.completion","created":1777030873,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image shows a man standing on the back of a yellow taxi, ironing a blue shirt. The taxi is parked on a city street with other vehicles and buildings in the background. The man appears to be balancing on the tailgate while performing the task. The scene suggests an unusual or humorous situation, as it is not typical for someone to iron clothes from the back of a moving vehicle.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":387,"completion_tokens":80,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
Using Python Requests#
[3]:
import requests
url = f"http://localhost:{port}/v1/chat/completions"
data = {
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {"url": example_image_url},
},
],
}
],
"max_tokens": 300,
}
response = requests.post(url, json=data)
print_highlight(response.text)
{"id":"ae020d88828b4aa3910ac5aa1e2a39b8","object":"chat.completion","created":1777030874,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image shows a man standing on the back of a yellow taxi, ironing a piece of clothing. The taxi is parked on a city street, and there are other taxis visible in the background. The man appears to be balancing on the tailgate while ironing, which is an unusual and humorous scene. The setting suggests an urban environment with buildings and trees in the background.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":384,"completion_tokens":77,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
Using OpenAI Python Client#
[4]:
from openai import OpenAI
client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-7B-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?",
},
{
"type": "image_url",
"image_url": {"url": example_image_url},
},
],
}
],
max_tokens=300,
)
print_highlight(response.choices[0].message.content)
The image shows a man standing on the back of a yellow taxi, ironing a piece of clothing. The taxi is parked on a city street, and there are other taxis visible in the background. The man appears to be balancing on the tailgate while ironing, which is an unusual and humorous scene. The setting suggests it might be in a busy urban area with tall buildings and flags in the background.
Multiple-Image Inputs#
The server also supports multiple images and interleaved text and images if the model supports it.
[5]:
from openai import OpenAI
client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-7B-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": example_image_url,
},
},
{
"type": "image_url",
"image_url": {
"url": logo_image_url,
},
},
{
"type": "text",
"text": "I have two very different images. They are not related at all. "
"Please describe the first image in one sentence, and then describe the second image in another sentence.",
},
],
}
],
temperature=0,
)
print_highlight(response.choices[0].message.content)
The first image shows a man ironing clothes on the back of a taxi in an urban setting. The second image is a stylized logo featuring the letters "SGL" with a book and a computer icon incorporated into the design.
[6]:
terminate_process(vision_process)