vLLM推理引擎教程2-Async LLM Streaming

CoPaw

CoPaw

AI应用
Qwen
Qwen3

内置vllm部署的Qwen3-4B-Instruct-2507模型,agentscope开源的类似openclaw个人助手。

1、示例代码

本文实现使用vLLM的AsyncLLM(V1异步推理引擎)进行流式文本生成。

完整代码如下:

import asyncio

from vllm import SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.sampling_params import RequestOutputKind
from vllm.v1.engine.async_llm import AsyncLLM


async def stream_response(engine: AsyncLLM, prompt: str, request_id: str) -> None:
    """
    Stream response from AsyncLLM and display tokens as they arrive.

    This function demonstrates the core streaming pattern:
    1. Create SamplingParams with DELTA output kind
    2. Call engine.generate() and iterate over the async generator
    3. Print new tokens as they arrive
    4. Handle the finished flag to know when generation is complete
    """
    print(f"\nPrompt: {prompt!r}")
    print("Response: ", end="", flush=True)

    # Configure sampling parameters for streaming
    sampling_params = SamplingParams(
        max_tokens=100,
        temperature=0.8,
        top_p=0.95,
        seed=42,  # For reproducible results
        output_kind=RequestOutputKind.DELTA,  # Get only new tokens each iteration
    )

    try:
        # Stream tokens from AsyncLLM
        async for output in engine.generate(
            request_id=request_id, prompt=prompt, sampling_params=sampling_params
        ):
            # Process each completion in the output
            for completion in output.outputs:
                # In DELTA mode, we get only new tokens generated since last iteration
                new_text = completion.text
                if new_text:
                    print(new_text, end="", flush=True)

            # Check if generation is finished
            if output.finished:
                print("\nGeneration complete!")
                break

    except Exception as e:
        print(f"\nError during streaming: {e}")
        raise


async def main():
    print("Initializing AsyncLLM...")

    # Create AsyncLLM engine with simple configuration
    engine_args = AsyncEngineArgs(
        model="/data/xiehao/workspace/models/Qwen/Qwen2.5-1.5B-Instruct",
        enforce_eager=True,  # Faster startup for examples
    )
    engine = AsyncLLM.from_engine_args(engine_args)

    try:
        # Example prompts to demonstrate streaming
        prompts = [
            "The future of artificial intelligence is",
            "In a galaxy far, far away",
            "The key to happiness is",
        ]

        print(f"Running {len(prompts)} streaming examples...")

        # Process each prompt
        for i, prompt in enumerate(prompts, 1):
            print(f"\n{'=' * 60}")
            print(f"Example {i}/{len(prompts)}")
            print(f"{'=' * 60}")

            request_id = f"stream-example-{i}"
            await stream_response(engine, prompt, request_id)

            # Brief pause between examples
            if i < len(prompts):
                await asyncio.sleep(0.5)

        print("\nAll streaming examples completed!")

    finally:
        # Always clean up the engine
        print("Shutting down engine...")
        engine.shutdown()


if __name__ == "__main__":
    try:
        asyncio.run(main())
    except KeyboardInterrupt:
        print("\nInterrupted by user")

2、AsyncLLM核心步骤

(1)标准范式

基于vLLM V1引擎实现流式生成的标准范式:

配参数 -> 建引擎 -> 设采样 -> 异步流 -> 清资源

(2)5个核心步骤

1)配置引擎参数

使用AsyncEngineArgs指定模型和运行选项:

engine_args = AsyncEngineArgs(model="/data/xiehao/workspace/models/Qwen/Qwen2.5-1.5B-Instruct", enforce_eager=True)

model:Hugging Face模型ID或本地路径

enforce_eager:是vLLM中一个用于控制模型执行模式的配置选项,其核心作用是:强制禁用CUDA Graph捕获,让模型以即时执行(eager execution)模式运行。要速度稳定性-> 开;要极致吞吐 -> 关。

2)创建AsyncLLM引擎实例

通过from_engine_args()初始化引擎:

engine = AsyncLLM.from_engine_args(engine_args)

次时会加载模型到GPU内存,准备推理

3)设置流式采样参数

使用SamplingParams并指定output_kind=RequestOutputKind.DELTA以启用增量输出模式

sampling_params = SamplingParams(
    max_tokens=100,
    temperature=0.8,
    top_p=0.95,
    seed=42,
    output_kind=RequestOutputKind.DELTA,  # 关键:只返回新生成的 token
)

4)调用engine.generate()并异步迭代输出

传入唯一 request_id、prompt 和 sampling_params,用 async for 流式消费结果:

async for output in engine.generate(request_id, prompt, sampling_params):
    for completion in output.outputs:
        new_text = completion.text  # 在 DELTA 模式下即本轮新增 token
        print(new_text, end="", flush=True)
    
    if output.finished:
        break  # 生成完成

每次循环收到的是自上次以来新生成的token;

output.finished表示整个请求已完成。

5)清理资源

程序结束前务必调用 shutdown() 释放 GPU 内存和后台资源:

engine.shutdown()

通常放在finally块中确保执行。

(3)补充说明

整个流程必须在异步上下文中运行,即使用async def + await + asyncio.run()

request_id必须全局唯一,用于区分并发请求

此模式适用于离线流式推理(一次输入,逐步输出)

您可能感兴趣的与本文相关的镜像

CoPaw

CoPaw

AI应用
Qwen
Qwen3

内置vllm部署的Qwen3-4B-Instruct-2507模型,agentscope开源的类似openclaw个人助手。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值