Offline Engine API#

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

Offline Batch Inference
Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

Non-streaming synchronous generation
Streaming synchronous generation
Non-streaming asynchronous generation
Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in custom_server.

Offline Batch Inference#

SGLang offline engine supports batch inference with efficient scheduling.

[1]:

# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-18 02:51:05 weight_utils.py:243] Using model weights format ['*.safetensors']

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.36it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.23it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.47it/s]

Non-streaming Synchronous Generation#

[2]:

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

===============================

Prompt: Hello, my name is
Generated text: Andy Lewis and I am a professional singer and musician based in the Midlands. I have been performing for many years and have a wealth of experience in a variety of styles including rock, pop, musical theatre and jazz.
As a seasoned musician, I have performed at numerous high-profile events, including weddings, corporate functions, and festivals. I am confident that my skills and experience would make me an excellent addition to your event.
I am available to perform as a solo artist, or as part of a band. I can also arrange for a live band to accompany me, depending on your preferences.
I have a wide repertoire of songs and can learn

===============================

Prompt: The president of the United States is
Generated text: making a public statement about the Russian hacking of the 2016 presidential election, but he's doing so in a way that is intentionally vague and open to multiple interpretations.
"It could be Russia, and it could be China, and it could be another country," Trump said. "It also could be somebody sitting on their bed who weighs 400 pounds, okay?"
This response has raised eyebrows among those who have been following the issue, as it seems to downplay the severity of the Russian hacking and suggest that it might be an inside job.
"It's not just a matter of Russia or China," said one former national security official. "

===============================

Prompt: The capital of France is
Generated text: also known for its exquisite cuisine, artistic treasures, and rich history. This is a city of broad boulevards, grand museums, and romantic Seine River views. It has a certain charm, known as 'joie de vivre,' that makes it so attractive to tourists and locals alike. Whether you're a history buff, a foodie, or an art lover, there's something for everyone in the City of Light. Here are some of the top things to do in Paris:
1. Visit the Eiffel Tower: The Eiffel Tower is an iconic Parisian landmark and one of the most recognizable structures in the

===============================

Prompt: The future of AI is
Generated text: all about ‘bots, but not just any bots – bots that can learn, interact and understand human emotions and needs. And the crème de la crème of this tech is a new breed of AI assistants, often referred to as ‘social robots’.
Also known as social chatbots or conversational AI, social robots are a type of AI designed to interact with humans in a more empathetic and engaging way. They use natural language processing (NLP) and machine learning algorithms to understand human emotions, intentions and context, allowing them to respond in a more human-like way.
Some social robots are designed to be physical, like robots

Streaming Synchronous Generation#

[3]:

prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

=== Testing synchronous streaming generation ===

Prompt: Hello, my name is

Generated text:  Michelle. I am a 27 year old student who is currently studying to become a teacher. I love reading, writing, and learning new things. I have a small dog named Max who is my best friend. I enjoy spending time outdoors, whether it be hiking, biking, or just taking a walk around the block. I am a bit of a movie buff and love watching old classics. I am looking forward to connecting with like-minded individuals and sharing my thoughts and ideas with you.
Welcome to the forums, Michelle! I'm glad you're here. Feel free to share your interests and ask questions - we're a friendly community.

Prompt: The capital of France is

Generated text:  the most visited city in the world with over 23 million visitors in 2019, according to the Mastercard Global Destination Cities Index. Paris is a popular destination for tourists due to its iconic landmarks, rich history, and cultural attractions.
France is a popular destination for tourists, with the most visited city in the world being Paris. The city attracts over 23 million visitors each year, making it a top destination for travelers. Paris is home to famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, which houses the Mona Lisa. The city also has a rich history and culture, with

Prompt: The future of AI is

Generated text:  not a single technology, but an ecosystem of interconnected technologies that will transform how we live and work. This year’s AI conference will bring together some of the brightest minds in the field to explore the potential of AI to drive business innovation, improve societal outcomes, and address some of the most pressing challenges facing our world.
What are the key areas of focus for AI in 2024?
The 2024 AI conference will focus on the following key areas:
1. **Human-Centric AI**: How to design AI systems that are fair, transparent, and accountable, and that prioritize human values and well-being.
2. **Explainable

Non-streaming Asynchronous Generation#

[4]:

prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

=== Testing asynchronous batch generation ===

Prompt: Hello, my name is

Generated text: Tanya and I am an Interior Decorator. I am so glad you have chosen to visit my website. My passion is to help my clients create beautiful spaces that inspire and nurture their lifestyle. With over 10 years of experience, I have honed my skills in creating cohesive, functional and visually stunning interiors. My approach is collaborative, and I love working with my clients to understand their unique tastes, needs and preferences. I believe that interior design should be a reflection of who you are, and I am dedicated to helping you achieve your vision.
I work with a diverse range of clients, from residential homeowners to commercial property owners, and

Prompt: The capital of France is

Generated text: Paris, located in the northern part of the country, along the Seine River. The city is famous for its stunning architecture, art museums, fashion, cuisine, and romantic atmosphere. Here are 10 interesting facts about Paris:
1. The Eiffel Tower was originally intended to be a temporary structure.
Built for the 1889 World's Fair, the iconic tower was meant to be dismantled after the event. However, it became an instant symbol of Paris and was left standing.
2. Paris has the most museums in the world.
With over 150 museums, Paris has more museums per square mile than any other city in

Prompt: The future of AI is

Generated text: not just about technology, but also about its ethical and societal implications. This requires a multidisciplinary approach, involving not just computer scientists and engineers, but also ethicists, sociologists, philosophers, and other experts. The book "Algorithms to Live By: The Computer Science of Human Decisions" by Brian Christian and Tom Griffiths, and the documentary "The Machine" directed by Sarah Teigh and David Bolter, both explore the intersection of human decision-making and artificial intelligence. The former is a more theoretical approach, while the latter is a more practical and documentary-based exploration of AI's potential impact on society.
We need to

Streaming Asynchronous Generation#

[5]:

prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

=== Testing asynchronous streaming generation ===

Prompt: Hello, my name is

Generated text:  Rebecca, and I am a proud dog mom to my furry best friend, a gorgeous golden retriever named Luna. My love for dogs has taken me on a journey to help others understand the importance of mental health and the role that dogs can play in supporting it. As a mental health advocate and a certified animal lover, I've seen firsthand the transformative power of the human-animal bond.
Luna, my loyal companion, has been with me through thick and thin, providing comfort, companionship, and a constant reminder that love and companionship are always available. Together, we've explored the beautiful world of canine companionship, and I

Prompt: The capital of France is

Generated text:  home to some of the world's most famous landmarks, including the Eiffel Tower and the Louvre Museum. This romantic city offers a wide variety of activities and experiences for visitors, including a visit to the famous Notre Dame Cathedral, a stroll along the Seine River, and a night of fine dining and wine tasting.
France is a country with a rich history, art, fashion, and cuisine, and Paris is its beating heart. A must-visit destination for anyone interested in art, history, culture, and romance.
The city is a treasure trove of art, history, and architecture, with famous landmarks like the Eiff

Prompt: The future of AI is

Generated text:  bright, and it will be even brighter when we can share our thoughts, feelings and knowledge with machines in our own voice. The good news is that we are already halfway there.
This week, Google’s AI-powered chatbot, LaMDA, made a breakthrough that could finally allow us to communicate with machines in our own voice. Google’s AI team has been working on improving the chatbot’s ability to understand and respond to complex human language, and their latest achievement is a major leap forward.
LaMDA, which stands for Large Language Model for Dialogue Applications, is a highly advanced language processing model that can understand and respond to natural

[6]:

llm.shutdown()

Offline Engine API

Contents

Offline Engine API#

Offline Batch Inference#

Non-streaming Synchronous Generation#

Streaming Synchronous Generation#

Non-streaming Asynchronous Generation#

Streaming Asynchronous Generation#