1、示例代码
本文实现使用vLLM的AsyncLLM(V1异步推理引擎)进行流式文本生成。
完整代码如下:
import asyncio
from vllm import SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.sampling_params import RequestOutputKind
from vllm.v1.engine.async_llm import AsyncLLM
async def stream_response(engine: AsyncLLM, prompt: str, request_id: str) -> None:
"""
Stream response from AsyncLLM and display tokens as they arrive.
This function demonstrates the core streaming pattern:
1. Create SamplingParams with DELTA output kind
2. Call engine.generate() and iterate over the async generator
3. Print new tokens as they arrive
4. Handle the finished flag to know when generation is complete
"""
print(f"\nPrompt: {prompt!r}")
print("Response: ", end="", flush=True)
# Configure sampling parameters for streaming
sampling_params = SamplingParams(
max_tokens=100,
temperature=0.8,
top_p=0.95,
seed=42, # For reproducible results
output_kind=RequestOutputKind.DELTA, # Get only new tokens each iteration
)
try:
# Stream tokens from AsyncLLM
async for output in engine.generate(
request_id=request_id, prompt=prompt, sampling_params=sampling_params
):
# Process each completion in the output
for completion in output.outputs:
# In DELTA mode, we get only new tokens generated since last iteration
new_text = completion.text
if new_text:
print(new_text, end="", flush=True)
# Check if generation is finished
if output.finished:
print("\nGeneration complete!")
break
except Exception as e:
print(f"\nError during streaming: {e}")
raise
async def main():
print("Initializing AsyncLLM...")
# Create AsyncLLM engine with simple configuration
engine_args = AsyncEngineArgs(
model="/data/xiehao/workspace/models/Qwen/Qwen2.5-1.5B-Instruct",
enforce_eager=True, # Faster startup for examples
)
engine = AsyncLLM.from_engine_args(engine_args)
try:
# Example prompts to demonstrate streaming
prompts = [
"The future of artificial intelligence is",
"In a galaxy far, far away",
"The key to happiness is",
]
print(f"Running {len(prompts)} streaming examples...")
# Process each prompt
for i, prompt in enumerate(prompts, 1):
print(f"\n{'=' * 60}")
print(f"Example {i}/{len(prompts)}")
print(f"{'=' * 60}")
request_id = f"stream-example-{i}"
await stream_response(engine, prompt, request_id)
# Brief pause between examples
if i < len(prompts):
await asyncio.sleep(0.5)
print("\nAll streaming examples completed!")
finally:
# Always clean up the engine
print("Shutting down engine...")
engine.shutdown()
if __name__ == "__main__":
try:
asyncio.run(main())
except KeyboardInterrupt:
print("\nInterrupted by user")
2、AsyncLLM核心步骤
(1)标准范式
基于vLLM V1引擎实现流式生成的标准范式:
配参数 -> 建引擎 -> 设采样 -> 异步流 -> 清资源
(2)5个核心步骤
1)配置引擎参数
使用AsyncEngineArgs指定模型和运行选项:
engine_args = AsyncEngineArgs(model="/data/xiehao/workspace/models/Qwen/Qwen2.5-1.5B-Instruct", enforce_eager=True)
model:Hugging Face模型ID或本地路径
enforce_eager:是vLLM中一个用于控制模型执行模式的配置选项,其核心作用是:强制禁用CUDA Graph捕获,让模型以即时执行(eager execution)模式运行。要速度稳定性-> 开;要极致吞吐 -> 关。
2)创建AsyncLLM引擎实例
通过from_engine_args()初始化引擎:
engine = AsyncLLM.from_engine_args(engine_args)
次时会加载模型到GPU内存,准备推理
3)设置流式采样参数
使用SamplingParams并指定output_kind=RequestOutputKind.DELTA以启用增量输出模式
sampling_params = SamplingParams(
max_tokens=100,
temperature=0.8,
top_p=0.95,
seed=42,
output_kind=RequestOutputKind.DELTA, # 关键:只返回新生成的 token
)
4)调用engine.generate()并异步迭代输出
传入唯一 request_id、prompt 和 sampling_params,用 async for 流式消费结果:
async for output in engine.generate(request_id, prompt, sampling_params):
for completion in output.outputs:
new_text = completion.text # 在 DELTA 模式下即本轮新增 token
print(new_text, end="", flush=True)
if output.finished:
break # 生成完成
每次循环收到的是自上次以来新生成的token;
output.finished表示整个请求已完成。
5)清理资源
程序结束前务必调用 shutdown() 释放 GPU 内存和后台资源:
engine.shutdown()
通常放在finally块中确保执行。
(3)补充说明
整个流程必须在异步上下文中运行,即使用async def + await + asyncio.run()
request_id必须全局唯一,用于区分并发请求
此模式适用于离线流式推理(一次输入,逐步输出)

2939

被折叠的 条评论
为什么被折叠?



