qwen API调用

原创

已于 2024-03-14 11:01:54 修改 · 3.5k 阅读

标签

#深度学习

收录于

于 2024-03-11 22:20:44 首次发布

本文介绍了GitHub上的VLLM项目，一个高吞吐量和内存高效的大型语言模型（LLMs）推理与服务引擎，以及如何在FastChat平台上本地部署和使用不同版本的模型，如Qwen-72B和量化模型。

GitHub - QwenLM/vllm-gptq: A high-throughput and memory-efficient inference and serving engine for LLMs

pip install fschat

python -m fastchat.serve.controller

python -m fastchat.serve.vllm_worker --model-path $model_path --tensor-parallel-size 2 --trust-remote-code

python -m fastchat.serve.openai_api_server --host localhost --port 8000

pip install --upgrade openai=0.28

import openai
# to get proper authentication, make sure to use a valid key that's listed in
# the --api-keys flag. if no flag value is provided, the `api_key` will be ignored.
openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"

model = "qwen"
call_args = {
    'temperature': 1.0,
    'top_p': 1.0,
    'top_k': -1,
    'max_tokens': 2048, # output-len
    'presence_penalty': 1.0,
    'fre