Qwen3-ASR#

Qwen3-ASR is an audio transcription model served through the OpenAI-compatible /v1/audio/transcriptions endpoint. It accepts one uploaded audio file per request and returns text.

Prerequisites#

Install sglang-omni by following Installation, then download the model:

hf download Qwen/Qwen3-ASR-1.7B

Server Configuration#

Qwen3-ASR runs a single ASR stage on one GPU.

sgl-omni serve \
  --model-path Qwen/Qwen3-ASR-1.7B \
  --port 8000

Transcribe Audio#

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F model=Qwen/Qwen3-ASR-1.7B \
  -F file=@tests/data/query_to_cars.wav \
  -F language=en \
  -F response_format=json

import requests

with open("tests/data/query_to_cars.wav", "rb") as f:
    resp = requests.post(
        "http://localhost:8000/v1/audio/transcriptions",
        data={
            "model": "Qwen/Qwen3-ASR-1.7B",
            "language": "en",
            "response_format": "json",
        },
        files={"file": ("query_to_cars.wav", f, "audio/wav")},
        timeout=300,
    )

resp.raise_for_status()
print(resp.json()["text"])

Request Parameters#

Parameter	Type	Default	Description
`file`	file	required	Audio file uploaded as multipart form data
`model`	string	server default	Model identifier
`language`	string	`en`	Language hint; `zh`/`cn` select Chinese, other values use English prompting
`response_format`	string	`json`	`json`, `verbose_json`, or `text`
`temperature`	float	`0.01` effective	Sampling temperature; `0` is converted to near-greedy `0.01`

verbose_json is accepted, but currently returns the same minimal JSON shape as json: {"text": "..."}.

max_new_tokens is supported inside the model request builder, but the public transcription endpoint does not currently expose it as a form field. The route uses the ASR stage default unless the pipeline is configured another way.

Benchmarking#

Use benchmarks/eval/benchmark_asr_seedtts.py to sweep ASR concurrency on SeedTTS reference audio through /v1/audio/transcriptions. It defaults to --model-path Qwen/Qwen3-ASR-1.7B; the shared request and metric logic lives in benchmarks.tasks.asr and also supports Fun-ASR through --model-path. The report includes RTF (processing time divided by audio duration) and RTFx (successful input-audio seconds divided by wall-clock seconds).

sgl-omni serve --model-path Qwen/Qwen3-ASR-1.7B --port 8000

# Sweep the full SeedTTS EN set (1088 clips) at 1..64 concurrency, 3 repeats:
python -m benchmarks.eval.benchmark_asr_seedtts \
  --port 8000 --concurrencies 1,2,4,8,16,32,64 --repeats 3 --warmup

The ASR CI gate runs Fun-ASR-Nano on this same benchmark entry point (tests/test_model/test_asr_ci_fun_asr.py). Qwen3-ASR remains the transcriber for the TTS and talker WER stages.

Known Limitations#

The endpoint accepts one uploaded file per request.
prompt is accepted by the HTTP endpoint for OpenAI compatibility, but Qwen3-ASR currently ignores it.
Audio is resampled to 16 kHz before transcription.

Qwen3-ASR

Contents