Offline Engine API#
SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:
Offline Batch Inference
Custom Server on Top of the Engine
This document focuses on the offline batch inference, demonstrating four different inference modes:
Non-streaming synchronous generation
Streaming synchronous generation
Non-streaming asynchronous generation
Streaming asynchronous generation
Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in custom_server.
Offline Batch Inference#
SGLang offline engine supports batch inference with efficient scheduling.
[1]:
# launch the offline engine
import sglang as sgl
import asyncio
llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.21it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.13it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.32it/s]
Non-streaming Synchronous Generation#
[2]:
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print("===============================")
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
===============================
Prompt: Hello, my name is
Generated text: David and I am a 30-year-old Software Engineer from Buenos Aires, Argentina. I am here to share my experience with the community. I am passionate about Artificial Intelligence, Machine Learning, and Cloud Computing, and I enjoy sharing my knowledge and learning from others.
My professional experience started about 8 years ago as a junior developer, working on web and mobile applications using various technologies like Java, PHP, and JavaScript. Over time, I became interested in AI and ML, and I started to learn and work with libraries like TensorFlow and PyTorch. I also got involved with cloud computing, and I gained experience with AWS and Google Cloud
===============================
Prompt: The president of the United States is
Generated text: an elected official, as well as the head of state and head of government for the country. The president is responsible for executing the laws of the land and overseeing the federal government. The president is elected through the Electoral College system, where electors from each state cast votes for the president and vice president.
The president serves a four-year term, with a limit of two terms. The president has a range of powers and responsibilities, including signing or vetoing bills, commanding the armed forces, and making appointments to the Supreme Court and other federal offices.
The president is also responsible for conducting foreign policy, meeting with foreign leaders, and negotiating treaties
===============================
Prompt: The capital of France is
Generated text: Paris, located in the north-central part of the country. Paris is the largest city in France and is considered one of the most romantic cities in the world.
The city is famous for its stunning architecture, museums, art galleries, fashion, cuisine, and historic landmarks. The Eiffel Tower is the most iconic landmark in Paris and a symbol of the city. The tower was built for the 1889 World’s Fair and stands at a height of 324 meters (1,063 feet). Visitors can take an elevator to the top of the tower for panoramic views of the city.
Paris is also home to many famous museums and art
===============================
Prompt: The future of AI is
Generated text: in the hands of humans
The AI Revolution is in full swing, and it’s easy to get caught up in the hype. But beneath the surface, a more nuanced and human-centric story is unfolding. Here’s a glimpse into the evolving relationship between humans and AI.
The AI Revolution is in full swing, and it’s easy to get caught up in the hype. But beneath the surface, a more nuanced and human-centric story is unfolding. Here’s a glimpse into the evolving relationship between humans and AI.
In the realm of artificial intelligence, the lines between human and machine are blurring. No longer a distant fantasy, AI has become
Streaming Synchronous Generation#
[3]:
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
print("\n=== Testing synchronous streaming generation ===")
for prompt in prompts:
print(f"\nPrompt: {prompt}")
print("Generated text: ", end="", flush=True)
for chunk in llm.generate(prompt, sampling_params, stream=True):
print(chunk["text"], end="", flush=True)
print()
=== Testing synchronous streaming generation ===
Prompt: Hello, my name is
Generated text: Dr. Siamak PourgholamRoozbahani, and I am an Assistant Professor in the Department of Computer Science and Engineering at the University of Nebraska-Lincoln. I received my Ph.D. in Electrical Engineering from the University of California, Los Angeles (UCLA) in 2014. My research interests are in the areas of computer networks, network security, and machine learning. I have published numerous papers in top-tier conferences and journals, including IEEE Infocom, ACM Mobicom, and IEEE Transactions on Information Theory.
My current research focuses on designing and analyzing machine learning-based solutions for various network security and
Prompt: The capital of France is
Generated text: a city that is famous for its stunning beauty, rich history, and world-class attractions. From the iconic Eiffel Tower to the Louvre Museum, there are countless things to see and do in Paris. Here are some of the top attractions and experiences to add to your Paris itinerary:
1. The Eiffel Tower: This iron lattice tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city from its observation decks.
2. The Louvre Museum: One of the world's largest and most famous museums, the Louvre is home to an impressive collection of art and artifacts, including the Mona Lisa
Prompt: The future of AI is
Generated text: shrouded in uncertainty
Artificial intelligence (AI) has revolutionized the way we live and work, transforming industries and altering the economic landscape. However, the future of AI is uncertain and shrouded in controversy.
There are several reasons why the future of AI is uncertain:
1. Lack of regulation: There is a lack of regulation and oversight in the development and deployment of AI systems, which raises concerns about their safety and accountability.
2. Job displacement: AI has the potential to automate many jobs, which could lead to widespread unemployment and social unrest.
3. Bias and discrimination: AI systems can perpetuate and amplify biases and
Non-streaming Asynchronous Generation#
[4]:
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
print("\n=== Testing asynchronous batch generation ===")
async def main():
outputs = await llm.async_generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print(f"\nPrompt: {prompt}")
print(f"Generated text: {output['text']}")
asyncio.run(main())
=== Testing asynchronous batch generation ===
Prompt: Hello, my name is
Generated text: Kyle and I am the lead author of this blog. I am a co-founder of the company Agiloft, and my main area of expertise is in Contract Lifecycle Management (CLM) and Enterprise Legal Management (ELM). I have over 20 years of experience in the software industry, with a strong background in product development, operations, and customer success.
Over the years, I have worked with numerous companies and organizations to develop and implement contract management solutions that meet their specific needs and goals. My experience has shown me the importance of having a robust contract management system in place to streamline processes, reduce costs, and increase efficiency.
In
Prompt: The capital of France is
Generated text: Paris. Paris is a major city located in the north-central part of the country, where the Seine River flows into the English Channel. The city is famous for its fashion, art, cuisine, and historical landmarks, such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum.
Paris is also a popular tourist destination, attracting millions of visitors each year. The city has a rich history, dating back to the Roman era, and has been the center of several empires throughout history, including the Roman Empire, the Frankish Empire, and the French Empire. Today, Paris is a global hub for business
Prompt: The future of AI is
Generated text: here, and it's now more accessible than ever. With the help of powerful AI platforms like Google Cloud AI Platform, Microsoft Azure Machine Learning, and Amazon SageMaker, businesses and developers can quickly build, deploy, and manage AI models without requiring extensive expertise in machine learning.
However, AI is not without its challenges. One of the major concerns is the issue of explainability, which refers to the ability to understand and interpret the decisions made by AI models. In this blog post, we'll explore the importance of explainability in AI and how it can be achieved with the help of AI platforms.
Why is Explainability Important in AI?
Streaming Asynchronous Generation#
[5]:
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
print("\n=== Testing asynchronous streaming generation ===")
async def main():
for prompt in prompts:
print(f"\nPrompt: {prompt}")
print("Generated text: ", end="", flush=True)
generator = await llm.async_generate(prompt, sampling_params, stream=True)
async for chunk in generator:
print(chunk["text"], end="", flush=True)
print()
asyncio.run(main())
=== Testing asynchronous streaming generation ===
Prompt: Hello, my name is
Generated text: John. I am a survivor of the hurricane that hit your city. I lost my home and all my belongings in the storm. My family and I are now living in a shelter. We are in need of assistance. We need food, clothing, and other essential items to get us back on our feet. We are grateful for any help we can get.
I can be reached at the shelter address below. Please contact me to arrange for assistance.
I am just a regular guy trying to make a living in the city. I don't have any connections or resources to fall back on. I am struggling to make ends meet as it is
Prompt: The capital of France is
Generated text: a city of beauty, romance, and intellectualism. From the iconic Eiffel Tower to the majestic Notre Dame Cathedral, Paris is a city that has inspired artists, writers, and musicians for centuries. Visitors can stroll along the Seine River, visit the famous Louvre Museum, and indulge in the city's culinary delights, including croissants, cheese, and wine. Whether you're interested in history, art, fashion, or food, Paris is a city that has something for everyone.
Paris is a popular tourist destination, and many people visit the city every year. However, it can be overwhelming to navigate the city, especially
Prompt: The future of AI is
Generated text: now – and it’s all about virtual assistants
In 2011, the world of artificial intelligence (AI) was on the cusp of a revolution. Virtual assistants were about to take center stage, and they would change the game forever. From Siri on Apple devices to Google Assistant and Alexa, the AI-powered virtual assistant has become an integral part of our daily lives.
So, what exactly is a virtual assistant?
A virtual assistant, also known as a conversational AI, is a computer program that uses natural language processing (NLP) and machine learning to interact with humans in a conversational manner. These AI-powered assistants can perform
[6]:
llm.shutdown()