vLLM推理引擎教程2-Async LLM Streaming

最新推荐文章于 2026-06-17 09:07:39 发布

原创最新推荐文章于 2026-06-17 09:07:39 发布 · 640 阅读

7 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#vLLM #人工智能

vLLM 专栏收录该内容

7 篇文章

订阅专栏

CoPaw

AI应用

Qwen

Qwen3

内置vllm部署的Qwen3-4B-Instruct-2507模型，agentscope开源的类似openclaw个人助手。

1、示例代码

本文实现使用vLLM的AsyncLLM（V1异步推理引擎）进行流式文本生成。

完整代码如下：

import asyncio

from vllm import SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.sampling_params import RequestOutputKind
from vllm.v1.engine.async_llm import AsyncLLM


async def stream_response(engine: AsyncLLM, prompt: str, request_id: str) -> None:
    """
    Stream response from AsyncLLM and display tokens as they arrive.

    This function demonstrates the core streaming pattern:
    1. Create SamplingParams with DELTA output kind
    2. Call engine.generate() and iterate over the async generator
    3. Print new tokens as they arrive
    4. Handle the finished flag to know when generation is complete
    """
    print(f"\nPrompt: {prompt!r}")
    print("Response: ", end="", flush=True)

    # Configure sampling parameters for streaming
    sampling_params = SamplingParams(
        max_tokens=100,
        temperature=0.8,
        top_p=0.95,
        seed=42,  # For reproducible results
        output_kind=RequestOutputKind.DELTA,  # Get only new tokens each iteration
    )

    try:
        # Stream tokens from AsyncLLM
        async for output in engine.generate(
            request_id=request_id, prompt=prompt, sampling_params=sampling_params
        ):
            # Process each completion in the output
            for completion in output.outputs:
                # In DELTA mode, we get only new tokens generated since last iteration
                new_text = completion.text
                if new_text:
                    print(new_text, end="", flush=True)

            # Check if generation is finished
            if output.finished:
                print("\nGeneration complete!")
                break

    except Exception as e:
        print(f"\nError during streaming: {e}")
        raise


async def main():
    print("Initializing AsyncLLM...")

    # Create AsyncLLM engine with simple configuration
    engine_args = AsyncEngineArgs(
        model="/data/xiehao/workspace/models/Qwen/Qwen2.5-1.5B-Instruct",
        enforce_eager=True,  # Faster startup for examples
    )
    engine = AsyncLLM.from_engine_args(engine_args)

    try:
        # Example prompts to demonstrate streaming
        prompts = [
            "The future of artificial intelligence is",
            "In a galaxy far, far away",
            "The key to happiness is",
        ]

        print(f"Running {len(prompts)} streaming examples...")

        # Process each prompt
        for i, prompt in enumerate(prompts, 1):
            print(f"\n{'=' * 60}")
            print(f"Example {i}/{len(prompts)}")
            print(f"{'=' * 60}")

            request_id = f"stream-example-{i}"
            await stream_response(engine, prompt, request_id)

            # Brief pause between examples
            if i < len(prompts):
                await asyncio.sleep(0.5)

        print("\nAll streaming examples completed!")

    finally:
        # Always clean up the engine
        print("Shutting down engine...")
        engine.shutdown()


if __name__ == "__main__":
    try:
        asyncio.run(main())
    except KeyboardInterrupt:
        print("\nInterrupted by user")

2、AsyncLLM核心步骤

（1）标准范式

基于vLLM V1引擎实现流式生成的标准范式：

配参数 -> 建引擎 -> 设采样 -> 异步流 -> 清资源

（2）5个核心步骤

1）配置引擎参数

使用AsyncEngineArgs指定模型和运行选项：

engine_args = AsyncEngineArgs(model="/data/xiehao/workspace/models/Qwen/Qwen2.5-1.5B-Instruct", enforce_eager=True)

model：Hugging Face模型ID或本地路径

enforce_eager：是vLLM中一个用于控制模型执行模式的配置选项，其核心作用是：强制禁用CUDA Graph捕获，让模型以即时执行（eager execution）模式运行。要速度稳定性-> 开；要极致吞吐 -> 关。

2）创建AsyncLLM引擎实例

通过from_engine_args()初始化引擎：

engine = AsyncLLM.from_engine_args(engine_args)

次时会加载模型到GPU内存，准备推理

3）设置流式采样参数

使用SamplingParams并指定output_kind=RequestOutputKind.DELTA以启用增量输出模式

sampling_params = SamplingParams(
    max_tokens=100,
    temperature=0.8,
    top_p=0.95,
    seed=42,
    output_kind=RequestOutputKind.DELTA,  # 关键：只返回新生成的 token
)

4）调用engine.generate()并异步迭代输出

传入唯一 request_id、prompt 和 sampling_params，用 async for 流式消费结果：

async for output in engine.generate(request_id, prompt, sampling_params):
    for completion in output.outputs:
        new_text = completion.text  # 在 DELTA 模式下即本轮新增 token
        print(new_text, end="", flush=True)
    
    if output.finished:
        break  # 生成完成

每次循环收到的是自上次以来新生成的token；

output.finished表示整个请求已完成。

5）清理资源

程序结束前务必调用 shutdown() 释放 GPU 内存和后台资源：