# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-18 02:51:05 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.36it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.23it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.21it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.47it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Michelle

.

 I

 am

 a

 

27

 year

 old

 student

 who

 is

 currently

 studying

 to

 become

 a

 teacher

.

 I

 love

 reading

,

 writing

,

 and

 learning

 new

 things

.

 I

 have

 a

 small

 dog

 named

 Max

 who

 is

 my

 best

 friend

.

 I

 enjoy

 spending

 time

 outdoors

,

 whether

 it

 be

 hiking

,

 biking

,

 or

 just

 taking

 a

 walk

 around

 the

 block

.

 I

 am

 a

 bit

 of

 a

 movie

 buff

 and

 love

 watching

 old

 classics

.

 I

 am

 looking

 forward

 to

 connecting

 with

 like

-minded

 individuals

 and

 sharing

 my

 thoughts

 and

 ideas

 with

 you

.


Welcome

 to

 the

 forums

,

 Michelle

!

 I

'm

 glad

 you

're

 here

.

 Feel

 free

 to

 share

 your

 interests

 and

 ask

 questions

 -

 we

're

 a

 friendly

 community

.






Generated text: 

 the

 most

 visited

 city

 in

 the

 world

 with

 over

 

23

 million

 visitors

 in

 

201

9

,

 according

 to

 the

 Master

card

 Global

 Destination

 Cities

 Index

.

 Paris

 is

 a

 popular

 destination

 for

 tourists

 due

 to

 its

 iconic

 landmarks

,

 rich

 history

,

 and

 cultural

 attractions

.


France

 is

 a

 popular

 destination

 for

 tourists

,

 with

 the

 most

 visited

 city

 in

 the

 world

 being

 Paris

.

 The

 city

 attracts

 over

 

23

 million

 visitors

 each

 year

,

 making

 it

 a

 top

 destination

 for

 travelers

.

 Paris

 is

 home

 to

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

,

 which

 houses

 the

 Mona

 Lisa

.

 The

 city

 also

 has

 a

 rich

 history

 and

 culture

,

 with




Generated text: 

 not

 a

 single

 technology

,

 but

 an

 ecosystem

 of

 interconnected

 technologies

 that

 will

 transform

 how

 we

 live

 and

 work

.

 This

 year

’s

 AI

 conference

 will

 bring

 together

 some

 of

 the

 brightest

 minds

 in

 the

 field

 to

 explore

 the

 potential

 of

 AI

 to

 drive

 business

 innovation

,

 improve

 societal

 outcomes

,

 and

 address

 some

 of

 the

 most

 pressing

 challenges

 facing

 our

 world

.


What

 are

 the

 key

 areas

 of

 focus

 for

 AI

 in

 

202

4

?


The

 

202

4

 AI

 conference

 will

 focus

 on

 the

 following

 key

 areas

:


1

.

 **

Human

-C

entric

 AI

**:

 How

 to

 design

 AI

 systems

 that

 are

 fair

,

 transparent

,

 and

 accountable

,

 and

 that

 prioritize

 human

 values

 and

 well

-being

.


2

.

 **

Ex

plain

able




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 Rebecca

,

 and

 I

 am

 a

 proud

 dog

 mom

 to

 my

 furry

 best

 friend

,

 a

 gorgeous

 golden

 retrie

ver

 named

 Luna

.

 My

 love

 for

 dogs

 has

 taken

 me

 on

 a

 journey

 to

 help

 others

 understand

 the

 importance

 of

 mental

 health

 and

 the

 role

 that

 dogs

 can

 play

 in

 supporting

 it

.

 As

 a

 mental

 health

 advocate

 and

 a

 certified

 animal

 lover

,

 I

've

 seen

 firsthand

 the

 transformative

 power

 of

 the

 human

-an

imal

 bond

.


L

una

,

 my

 loyal

 companion

,

 has

 been

 with

 me

 through

 thick

 and

 thin

,

 providing

 comfort

,

 companions

hip

,

 and

 a

 constant

 reminder

 that

 love

 and

 companions

hip

 are

 always

 available

.

 Together

,

 we

've

 explored

 the

 beautiful

 world

 of

 canine

 companions

hip

,

 and

 I




Generated text: 

 home

 to

 some

 of

 the

 world

's

 most

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 This

 romantic

 city

 offers

 a

 wide

 variety

 of

 activities

 and

 experiences

 for

 visitors

,

 including

 a

 visit

 to

 the

 famous

 Notre

 Dame

 Cathedral

,

 a

 stroll

 along

 the

 Se

ine

 River

,

 and

 a

 night

 of

 fine

 dining

 and

 wine

 tasting

.


France

 is

 a

 country

 with

 a

 rich

 history

,

 art

,

 fashion

,

 and

 cuisine

,

 and

 Paris

 is

 its

 beating

 heart

.

 A

 must

-

visit

 destination

 for

 anyone

 interested

 in

 art

,

 history

,

 culture

,

 and

 romance

.


The

 city

 is

 a

 treasure

 tro

ve

 of

 art

,

 history

,

 and

 architecture

,

 with

 famous

 landmarks

 like

 the

 E

iff




Generated text: 

 bright

,

 and

 it

 will

 be

 even

 brighter

 when

 we

 can

 share

 our

 thoughts

,

 feelings

 and

 knowledge

 with

 machines

 in

 our

 own

 voice

.

 The

 good

 news

 is

 that

 we

 are

 already

 halfway

 there

.


This

 week

,

 Google

’s

 AI

-powered

 chat

bot

,

 La

MD

A

,

 made

 a

 breakthrough

 that

 could

 finally

 allow

 us

 to

 communicate

 with

 machines

 in

 our

 own

 voice

.

 Google

’s

 AI

 team

 has

 been

 working

 on

 improving

 the

 chat

bot

’s

 ability

 to

 understand

 and

 respond

 to

 complex

 human

 language

,

 and

 their

 latest

 achievement

 is

 a

 major

 leap

 forward

.


La

MD

A

,

 which

 stands

 for

 Large

 Language

 Model

 for

 Dialogue

 Applications

,

 is

 a

 highly

 advanced

 language

 processing

 model

 that

 can

 understand

 and

 respond

 to

 natural




In [6]:
llm.shutdown()