AI API流式响应的工程实现：从SSE协议到Token级输出控制

原创于 2026-06-29 16:47:00 发布 · 259 阅读

1 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#Python

流式输出是AI应用的核心体验。本文从SSE协议底层、Python异步流式处理、Token级输出控制三个层面，拆解流式响应的工程实现细节。

一、为什么流式输出这么重要

非流式调用的问题很明显——用户发送请求后要等几秒甚至十几秒才能看到完整回复，体验极差。

流式输出的价值在于首Token延迟。用户发送请求后，200-500ms内就能看到第一个字，后续内容逐字流出。从心理学角度，这把"等待"变成了"阅读"，体验质的飞跃。

非流式：请求 → [等待3秒] → 完整回复一次性出现
流式：  请求 → [200ms] 首字 → 逐字流出 → 完成

但流式输出的工程实现比非流式复杂得多，涉及SSE协议、异步处理、背压控制、错误恢复等多个环节。

二、SSE协议：流式输出的基础

2.1 SSE是什么

SSE（Server-Sent Events）是基于HTTP的单向长连接协议。服务器可以持续向客户端推送数据，客户端不需要轮询。

HTTP请求：
POST /v1/chat/completions HTTP/1.1
Content-Type: application/json

{"model": "gpt-4o", "messages": [...], "stream": true}

HTTP响应：
HTTP/1.1 200 OK
Content-Type: text/event-stream

data: {"id":"chatcmpl-xxx","choices":[{"delta":{"content":"你"}}]}

data: {"id":"chatcmpl-xxx","choices":[{"delta":{"content":"好"}}]}

data: {"id":"chatcmpl-xxx","choices":[{"delta":{},"finish_reason":"stop"}]}

data: [DONE]

每个data:行是一个独立的JSON对象，delta.content是增量内容。最后的[DONE]标记流结束。

2.2 SSE vs WebSocket

为什么AI API用SSE而不是WebSocket？

维度	SSE	WebSocket
方向	单向（服务器→客户端）	双向
协议	HTTP	独立协议
复杂度	低	高
兼容性	好（HTTP基础设施）	需要额外支持
重连	自动重连	需手动实现

AI聊天场景中，客户端只需要发送一次请求，然后接收持续输出的回复——典型的单向数据流，SSE完美匹配。

三、Python异步流式处理

3.1 基础实现

用OpenAI SDK的异步流式接口：

import asyncio
from openai import AsyncOpenAI

async def stream_chat():
    # 配置客户端
    # 可以直连官方API，也可以通过中转站
    # 这里以魔芋AI中转站为例（注册见代码块底部注释）
    client = AsyncOpenAI(
        api_key="your-api-key",
        base_url="https://api.moyu.info/v1"
        # 魔芋AI注册地址（放在代码注释中）：
        # https://www.moyu.info/register?aff=CRB8
        # 新用户注册有免费额度，支持GPT/Claude/Gemini/DeepSeek等模型
    )
    
    stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "用Python实现快速排序"}],
        stream=True
    )
    
    async for chunk in stream:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)
    
    print()  # 换行

asyncio.run(stream_chat())

3.2 流式输出的中途取消

用户可能在输出过程中取消请求。正确处理取消很重要：

async def stream_chat_with_cancel():
    client = AsyncOpenAI(
        api_key="your-api-key",
        base_url="https://api.moyu.info/v1"
    )
    
    stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "写一篇长文"}],
        stream=True
    )
    
    collected = []
    try:
        async for chunk in stream:
            if chunk.choices and chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                collected.append(content)
                print(content, end="", flush=True)
                
                # 用户取消时（比如按了Ctrl+C）
                # stream会抛出asyncio.CancelledError
    
    except asyncio.CancelledError:
        print(f"\n\n[已取消，已接收 {len(collected)} 个chunk]")
        # 这里可以做清理工作：保存已生成的内容等
        return "".join(collected)
    
    return "".join(collected)

3.3 背压控制

如果消费端处理速度慢于生产端，需要背压控制避免内存溢出：

import asyncio

async def stream_with_backpressure():
    client = AsyncOpenAI(
        api_key="your-api-key",
        base_url="https://api.moyu.info/v1"
    )
    
    # 用Queue作为缓冲区，设置最大容量
    buffer = asyncio.Queue(maxsize=100)
    
    async def producer():
        """从API接收数据放入队列"""
        try:
            stream = await client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": "讲个长故事"}],
                stream=True
            )
            async for chunk in stream:
                if chunk.choices and chunk.choices[0].delta.content:
                    # 队列满时会阻塞，实现背压
                    await buffer.put(chunk.choices[0].delta.content)
        finally:
            await buffer.put(None)  # 结束标记
    
    async def consumer():
        """从队列取出数据处理"""
        total = 0
        while True:
            content = await buffer.get()
            if content is None:
                break
            # 模拟慢速消费（比如写文件、调另一个API）
            await asyncio.sleep(0.01)
            total += len(content)
        print(f"\n总共处理 {total} 个字符")
    
    # 并发运行生产者和消费者
    await asyncio.gather(producer(), consumer())

四、Token级输出控制

4.1 流式Token统计

非流式调用中，Token数在response.usage里直接返回。流式调用默认不返回usage，需要手动统计：

async def stream_with_usage():
    client = AsyncOpenAI(
        api_key="your-api-key",
        base_url="https://api.moyu.info/v1"
    )
    
    stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "解释量子计算"}],
        stream=True,
        stream_options={"include_usage": True}  # 关键参数
    )
    
    prompt_tokens = 0
    completion_tokens = 0
    
    async for chunk in stream:
        # 内容chunk
        if chunk.choices and chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
        
        # usage chunk（最后一个chunk）
        if chunk.usage:
            prompt_tokens = chunk.usage.prompt_tokens
            completion_tokens = chunk.usage.completion_tokens
    
    print(f"\n\nInput tokens: {prompt_tokens}")
    print(f"Output tokens: {completion_tokens}")
    print(f"Total: {prompt_tokens + completion_tokens}")

4.2 输出长度控制

有时候需要在生成到一定长度时停止：

async def stream_with_limit(max_chars=500):
    client = AsyncOpenAI(
        api_key="your-api-key",
        base_url="https://api.moyu.info/v1"
    )
    
    stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "写一篇散文"}],
        stream=True
    )
    
    char_count = 0
    async for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            char_count += len(content)
            
            if char_count > max_chars:
                print(f"\n[已达 {max_chars} 字上限，停止]")
                # 调用close()关闭流
                await stream.close()
                break
            
            print(content, end="", flush=True)

4.3 关键词触发动作

在流式输出中检测特定关键词，触发动作（比如检测到代码块时高亮显示）：

import re

async def stream_with_keyword_detection():
    client = AsyncOpenAI(
        api_key="your-api-key",
        base_url="https://api.moyu.info/v1"
    )
    
    stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "写一个Python排序函数并解释"}],
        stream=True
    )
    
    buffer = ""
    in_code_block = False
    
    async for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            buffer += content
            
            # 检测代码块开始/结束
            if "```" in buffer:
                if not in_code_block:
                    # 代码块开始
                    lang_match = re.search(r'```(\w+)', buffer)
                    lang = lang_match.group(1) if lang_match else "text"
                    print(f"\n[代码块开始: {lang}]")
                    in_code_block = True
                else:
                    # 代码块结束
                    print(f"\n[代码块结束]")
                    in_code_block = False
                buffer = ""
            
            print(content, end="", flush=True)

五、错误处理与重试

5.1 流式请求的错误类型

from openai import (
    APITimeoutError,
    APIConnectionError,
    RateLimitError,
    InternalServerError
)

async def stream_with_retry(prompt, max_retries=3):
    client = AsyncOpenAI(
        api_key="your-api-key",
        base_url="https://api.moyu.info/v1",
        timeout=30.0  # 设置超时
    )
    
    for attempt in range(max_retries):
        try:
            stream = await client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": prompt}],
                stream=True
            )
            
            async for chunk in stream:
                if chunk.choices and chunk.choices[0].delta.content:
                    yield chunk.choices[0].delta.content
            return  # 成功则退出
            
        except RateLimitError:
            # 429: 限流，等待后重试
            wait = 2 ** attempt
            print(f"\n[限流，{wait}s后重试]")
            await asyncio.sleep(wait)
            
        except APITimeoutError:
            # 超时，缩短max_tokens重试
            print(f"\n[超时，重试]")
            continue
            
        except APIConnectionError:
            # 连接错误，检查中转站状态
            print(f"\n[连接错误，重试]")
            await asyncio.sleep(1)
            continue
            
        except InternalServerError:
            # 500: 服务端错误
            print(f"\n[服务端错误，重试]")
            await asyncio.sleep(2)
            continue
    
    raise Exception(f"重试 {max_retries} 次后仍失败")

5.2 断流续传

如果流式中途断开，可以重新发起请求，让模型从断点继续：

async def stream_with_resume(prompt, max_chars=10000):
    client = AsyncOpenAI(
        api_key="your-api-key",
        base_url="https://api.moyu.info/v1"
    )
    
    collected = ""
    retries = 0
    
    while len(collected) < max_chars and retries < 3:
        try:
            # 如果已有部分内容，让模型从断点继续
            messages = [{"role": "user", "content": prompt}]
            if collected:
                messages = [
                    {"role": "user", "content": prompt},
                    {"role": "assistant", "content": collected},
                    {"role": "user", "content": "请继续"}
                ]
            
            stream = await client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                stream=True
            )
            
            async for chunk in stream:
                if chunk.choices and chunk.choices[0].delta.content:
                    content = chunk.choices[0].delta.content
                    collected += content
                    print(content, end="", flush=True)
            
            break  # 正常结束
            
        except Exception as e:
            retries += 1
            print(f"\n[断流，重试 {retries}/3: {e}]")
            await asyncio.sleep(2)
    
    return collected

六、性能优化技巧

6.1 连接池复用

import httpx

# 创建可复用的HTTP客户端
http_client = httpx.AsyncClient(
    limits=httpx.Limits(
        max_connections=100,
        max_keepalive_connections=20
    ),
    timeout=httpx.Timeout(30.0, connect=5.0)
)

client = AsyncOpenAI(
    api_key="your-api-key",
    base_url="https://api.moyu.info/v1",
    http_client=http_client  # 复用连接池
)

6.2 并发流式请求

同时发起多个流式请求，合并输出：

async def concurrent_streams(prompts: list):
    client = AsyncOpenAI(
        api_key="your-api-key",
        base_url="https://api.moyu.info/v1"
    )
    
    async def single_stream(prompt, index):
        stream = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            stream=True
        )
        result = ""
        async for chunk in stream:
            if chunk.choices and chunk.choices[0].delta.content:
                result += chunk.choices[0].delta.content
        return index, result
    
    # 并发执行
    tasks = [single_stream(p, i) for i, p in enumerate(prompts)]
    results = await asyncio.gather(*tasks)
    
    # 按顺序输出
    results.sort(key=lambda x: x[0])
    for _, text in results:
        print(text)

七、完整示例：带UI的流式聊天

把前面的组件组合起来，实现一个完整的流式聊天后端：

# app.py - 完整的流式聊天服务
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from openai import AsyncOpenAI
import json
import asyncio

app = FastAPI()

# 客户端配置
# 支持直连或通过中转站
# 示例使用魔芋AI中转站（兼容OpenAI协议）
client = AsyncOpenAI(
    api_key="your-api-key",
    base_url="https://api.moyu.info/v1",
    # 中转站注册：https://www.moyu.info/register?aff=CRB8
    timeout=60.0
)

class ChatRequest(BaseModel):
    message: str
    model: str = "gpt-4o-mini"

@app.post("/chat")
async def chat(req: ChatRequest):
    async def generate():
        try:
            stream = await client.chat.completions.create(
                model=req.model,
                messages=[
                    {"role": "system", "content": "你是一个技术助手"},
                    {"role": "user", "content": req.message}
                ],
                stream=True,
                stream_options={"include_usage": True}
            )
            
            total_tokens = 0
            async for chunk in stream:
                if chunk.choices and chunk.choices[0].delta.content:
                    data = {"content": chunk.choices[0].delta.content}
                    yield f"data: {json.dumps(data)}\n\n"
                
                if chunk.usage:
                    total_tokens = chunk.usage.completion_tokens
            
            yield f"data: {json.dumps({'done': True, 'tokens': total_tokens})}\n\n"
            
        except Exception as e:
            yield f"data: {json.dumps({'error': str(e)})}\n\n"
    
    return StreamingResponse(generate(), media_type="text/event-stream")

# 启动: uvicorn app:app --reload

八、总结

流式输出的工程实现涉及四个层面：

协议层：理解SSE格式，正确解析data:行和[DONE]标记
异步层：用async for处理流，正确处理取消和背压
控制层：Token统计、长度限制、关键词检测
容错层：超时重试、断流续传、连接池复用

掌握这些，就能构建稳定可靠的流式AI应用。文中代码使用OpenAI兼容协议，适用于直连或通过任何兼容中转站调用。有问题欢迎评论区讨论。