LLaDA2.0-Uni#

LLaDA2.0-Uni is a multi-modal model that accepts text, and image input and can produce text-only or image-only or text + image output.

Highlights#

  • Unified dLLM-MoE Backbone — Built on LLaDA 2.0, unifying multimodal understanding and generation.

  • Top-Tier Understanding & Generation — Matches dedicated VLMs in visual QA and document understanding, while generating high-quality images.

  • Interleaved Generation & Reasoning — Empowered by unified discrete representations, unlocking interleaved generation and reasoning.

Architecture#

LLaDA2.0-Uni Architecture

LLaDA2.0-Uni unifies multimodal understanding and generation into a simple Mask Token Prediction paradigm. Visual inputs are encoded by the SigLIP-VQ tokenizer into discrete semantic tokens, then mapped alongside text tokens to backbone hidden states under a unified mask prediction objective. Output tokens are decoded back to text via the Text De-Tokenizer, or reconstructed into high-fidelity images through the Diffusion Decoder. Empowered by unified discrete representations, it effortlessly handles complex interleaved generation and unlocks advanced interleaved reasoning, interleaving <|image|>…<|/image|> chunks to enable end-to-end training and inference within a single coherent framework.

Prerequisites#

Install sglang-omni by following Installation.

Server Configuration#

LLaDA2.0-Uni runs a 4-stage pipeline (preprocessing → image_encoder → thinker → decode) on a single GPU.

sgl-omni serve --model-path inclusionAI/LLaDA2.0-Uni --port 8000

Text Input#

Send a text-only prompt and get a text response.

cURL

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inclusionAI/LLaDA2.0-Uni",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 256
  }'

Python

import requests

resp = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "inclusionAI/LLaDA2.0-Uni",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 256,
    },
)
resp.raise_for_status()
result = resp.json()
print(result["choices"][0]["message"]["content"])

Image and Text Input#

Send an image with a text prompt to get a text response.

cURL

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inclusionAI/LLaDA2.0-Uni",
    "messages": [{"role": "user", "content": "Briefly describe the cars in this image."}],
    "images": ["tests/data/cars.jpg"],
    "modalities": ["text"],
    "max_tokens": 16
  }'

Python

import requests

resp = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "inclusionAI/LLaDA2.0-Uni",
        "messages": [{"role": "user", "content": "Briefly describe the cars in this image."}],
        "images": ["tests/data/cars.jpg"],
        "modalities": ["text"],
        "max_tokens": 16,
    },
)
resp.raise_for_status()
result = resp.json()
print(result["choices"][0]["message"]["content"])

Images can also be passed inline using the OpenAI multi-content format:

import requests

resp = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "inclusionAI/LLaDA2.0-Uni",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": "tests/data/cars.jpg"}},
                    {"type": "text", "text": "Briefly describe the cars in this image."},
                ],
            }
        ],
        "modalities": ["text"],
        "max_tokens": 16,
    },
)
resp.raise_for_status()
result = resp.json()
print(result["choices"][0]["message"]["content"])

Request Parameters#

The table below lists all parameters accepted by the /v1/chat/completions endpoint for LLaDA2.0-Uni.

Parameter

Type

Default

Description

model

string

null

Model identifier

messages

list

(required)

List of chat messages, each with role and content

modalities

list

["text"]

Output modalities (only ["text"] is supported)

images

list

null

List of image file paths (local paths or URLs)

max_tokens

int

null

Maximum number of tokens to generate

Incoming Features#

  • Text-to-image generation

  • Text-to-Image Generation with Thinking

  • Interleaved Generation