Query VLM with Offline Engine#

This tutorial demonstrates how to use SGLang’s offline Engine API to query VLMs. We will demonstrate usage with Qwen2.5-VL and Llama 4. This section demonstrates three different calling approaches:

  1. Basic Call: Directly pass images and text.

  2. Processor Output: Use HuggingFace processor for data preprocessing.

  3. Precomputed Embeddings: Pre-calculate image features to improve inference efficiency.

Understanding the Three Input Formats#

SGLang supports three ways to pass visual data, each optimized for different scenarios:

1. Raw Images - Simplest approach#

  • Pass PIL Images, file paths, URLs, or base64 strings directly

  • SGLang handles all preprocessing automatically

  • Best for: Quick prototyping, simple applications

2. Processor Output - For custom preprocessing#

  • Pre-process images with HuggingFace processor

  • Pass the complete processor output dict with format: "processor_output"

  • Best for: Custom image transformations, integration with existing pipelines

  • Requirement: Must use input_ids instead of text prompt

3. Precomputed Embeddings - For maximum performance#

  • Pre-calculate visual embeddings using the vision encoder

  • Pass embeddings with format: "precomputed_embedding"

  • Best for: Repeated queries on same images, caching, high-throughput serving

  • Performance gain: Avoids redundant vision encoder computation (30-50% speedup)

Key Rule: Within a single request, use only one format for all images. Don’t mix formats.

The examples below demonstrate all three approaches with both Qwen2.5-VL and Llama 4 models.

Querying Qwen2.5-VL Model#

[1]:
import nest_asyncio

nest_asyncio.apply()

import sglang.test.doc_patch  # noqa: F401

model_path = "Qwen/Qwen2.5-VL-3B-Instruct"
chat_template = "qwen2-vl"
example_image_url = "https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png"
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.9.1+cu130).
[2]:
from io import BytesIO
import requests
from PIL import Image

from sglang.srt.parser.conversation import chat_templates

image = Image.open(BytesIO(requests.get(example_image_url).content))

conv = chat_templates[chat_template].copy()
conv.append_message(conv.roles[0], f"What's shown here: {conv.image_token}?")
conv.append_message(conv.roles[1], "")
conv.image_data = [image]

print("Generated prompt text:")
print(conv.get_prompt())
print(f"\nImage size: {image.size}")
image
Generated prompt text:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What's shown here: <|vision_start|><|image_pad|><|vision_end|>?<|im_end|>
<|im_start|>assistant


Image size: (570, 380)
[2]:
../_images/advanced_features_vlm_query_4_1.png

Basic Offline Engine API Call#

[3]:
from sglang import Engine

llm = Engine(model_path=model_path, chat_template=chat_template, log_level="warning")
No platform detected. Using base SRTPlatform with defaults.
`BaseImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `BaseImageProcessor` instead.
`torch_dtype` is deprecated! Use `dtype` instead!
The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
[2026-04-24 09:48:30] The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.9.1+cu130).
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.9.1+cu130).
No platform detected. Using base SRTPlatform with defaults.
No platform detected. Using base SRTPlatform with defaults.
`BaseImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `BaseImageProcessor` instead.
`BaseImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `BaseImageProcessor` instead.
`torch_dtype` is deprecated! Use `dtype` instead!
[2026-04-24 09:48:35] `torch_dtype` is deprecated! Use `dtype` instead!
The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
[2026-04-24 09:48:36] The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Multi-thread loading shards: 100% Completed | 2/2 [00:01<00:00,  1.60it/s]
[4]:
out = llm.generate(prompt=conv.get_prompt(), image_data=[image])
print("Model response:")
print(out["text"])
2026-04-24 09:48:44,781 - CUTE_DSL - WARNING - [handle_import_error] - Unexpected error during package walk: cutlass.cute.experimental
[2026-04-24 09:48:44] Unexpected error during package walk: cutlass.cute.experimental
Model response:
The image shows a scene from New York City with two yellow taxis parked along the street. The person in the image is hanging clothes on the back of a car. The surrounding environment and the presence of flags suggest that this is a busy urban area.

Call with Processor Output#

Using a HuggingFace processor to preprocess text and images, and passing the processor_output directly into Engine.generate.

[5]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
processor_output = processor(
    images=[image], text=conv.get_prompt(), return_tensors="pt"
)

out = llm.generate(
    input_ids=processor_output["input_ids"][0].detach().cpu().tolist(),
    image_data=[dict(processor_output, format="processor_output")],
)
print("Response using processor output:")
print(out["text"])
Response using processor output:
The image shows a yellow taxi cab driving down a city street. The taxi is equipped with two sandwich boards attached to its sides. The sandwich boards appear to be displaying blank or unloaded displays, suggesting that they are not currently in use for advertising or other purposes.

In the foreground, a person in a yellow shirt is standing on the sidewalk, holding a clothes rack. This suggests that the person might be involved in some type of event or activity related to the sandwich boards, possibly setting them up or keeping them in order. The background includes a building with various shop signs and windows, indicating an urban environment, likely a busy city area.

Call with Precomputed Embeddings#

You can pre-calculate image features to avoid repeated visual encoding processes.

[6]:
from transformers import AutoProcessor
from transformers import Qwen2_5_VLForConditionalGeneration

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path).eval()
vision = model.model.visual.cuda()
[7]:
processor_output = processor(
    images=[image], text=conv.get_prompt(), return_tensors="pt"
)

input_ids = processor_output["input_ids"][0].detach().cpu().tolist()

precomputed_embeddings = vision(
    processor_output["pixel_values"].cuda(), processor_output["image_grid_thw"].cuda()
)
precomputed_embeddings = precomputed_embeddings.pooler_output

multi_modal_item = dict(
    processor_output,
    format="precomputed_embedding",
    feature=precomputed_embeddings,
)

out = llm.generate(input_ids=input_ids, image_data=[multi_modal_item])
print("Response using precomputed embeddings:")
print(out["text"])

llm.shutdown()
Response using precomputed embeddings:
This image shows a scene in a busy urban area with several key elements:

1. **Yellow Cabs**: There are two yellow taxis in the foreground, parked on the street.
2. **Human Activity**: A person dressed in a yellow shirt and glasses is standing near the taxis, seemingly preparing to load or unload items from the back of the vehicles. The person is using a foldable table and some cloth or blankets.
3. **Cityscape**: The background features tall buildings and signs, indicative of a city environment, possibly Manhattan, given the specific type of yellow taxi and the cityscape.

The image appears to depict a common scenario

Querying Llama 4 Vision Model#

model_path = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
chat_template = "llama-4"

from io import BytesIO
import requests
from PIL import Image

from sglang.srt.parser.conversation import chat_templates

# Download the same example image
image = Image.open(BytesIO(requests.get(example_image_url).content))

conv = chat_templates[chat_template].copy()
conv.append_message(conv.roles[0], f"What's shown here: {conv.image_token}?")
conv.append_message(conv.roles[1], "")
conv.image_data = [image]

print("Llama 4 generated prompt text:")
print(conv.get_prompt())
print(f"Image size: {image.size}")

image

Llama 4 Basic Call#

Llama 4 requires more computational resources, so it’s configured with multi-GPU parallelism (tp_size=4) and larger context length.

llm = Engine(
    model_path=model_path,
    enable_multimodal=True,
    attention_backend="fa3",
    tp_size=4,
    context_length=65536,
)

out = llm.generate(prompt=conv.get_prompt(), image_data=[image])
print("Llama 4 response:")
print(out["text"])

Call with Processor Output#

Using HuggingFace processor to preprocess data can reduce computational overhead during inference.

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
processor_output = processor(
    images=[image], text=conv.get_prompt(), return_tensors="pt"
)

out = llm.generate(
    input_ids=processor_output["input_ids"][0].detach().cpu().tolist(),
    image_data=[dict(processor_output, format="processor_output")],
)
print("Response using processor output:")
print(out)

Call with Precomputed Embeddings#

from transformers import AutoProcessor
from transformers import Llama4ForConditionalGeneration

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto"
).eval()

vision = model.vision_model.cuda()
multi_modal_projector = model.multi_modal_projector.cuda()

print(f'Image pixel values shape: {processor_output["pixel_values"].shape}')
input_ids = processor_output["input_ids"][0].detach().cpu().tolist()

# Process image through vision encoder
image_outputs = vision(
    processor_output["pixel_values"].to("cuda"),
    aspect_ratio_ids=processor_output["aspect_ratio_ids"].to("cuda"),
    aspect_ratio_mask=processor_output["aspect_ratio_mask"].to("cuda"),
    output_hidden_states=False
)
image_features = image_outputs.last_hidden_state

# Flatten image features and pass through multimodal projector
vision_flat = image_features.view(-1, image_features.size(-1))
precomputed_embeddings = multi_modal_projector(vision_flat)

# Build precomputed embedding data item
mm_item = dict(
    processor_output,
    format="precomputed_embedding",
    feature=precomputed_embeddings
)

# Use precomputed embeddings for efficient inference
out = llm.generate(input_ids=input_ids, image_data=[mm_item])
print("Llama 4 precomputed embedding response:")
print(out["text"])