zzh/vllm-npu-plugin

Fork 0

mirror of https://github.com/handsomezhuzhu/vllm-npu-plugin.git synced 2026-02-20 11:42:30 +00:00

Files

handsomezhuzhu c00c47a5b2 feat: 添加环境安装指南、验证与基准测试文档

2026-02-11 00:13:26 +08:00

7.0 KiB

Raw Permalink Blame History

vllm-npu-plugin 验证与基准测试指南

一、功能验证（先确认能正常工作）

1.1 离线推理测试

cd /workspace/mnt/vllm_ascend/vllm-npu-plugin

python -c "
from vllm import LLM
llm = LLM(
    model='/workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct',
    dtype='float16',
    trust_remote_code=True,
)
outputs = llm.generate(['你好，请简单介绍一下自己'])
for out in outputs:
    print(out.outputs[0].text)
"

预期：模型正常加载，输出合理的中文回复。✅ 已验证通过。

1.2 OpenAI API 服务测试

# 终端 1：启动服务
python -m vllm.entrypoints.openai.api_server \
  --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
  --dtype float16 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

# 终端 2：发送请求
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "什么是大语言模型？"}],
    "max_tokens": 256
  }'

1.3 流式输出测试

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "用Python写一个快速排序"}],
    "max_tokens": 512,
    "stream": true
  }'

二、性能基准测试

vLLM v0.11.0 内置了基准测试工具，通过 vllm bench 命令调用。

2.1 延迟测试（Latency）

测量单次请求从输入到输出的端到端延迟：

vllm bench latency \
  --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
  --dtype float16 \
  --trust-remote-code \
  --input-len 128 \
  --output-len 128 \
  --batch-size 1 \
  --num-iters 10

关注指标：

avg latency — 平均延迟（秒）
p99 latency — P99 延迟

2.2 吞吐量测试（Throughput）

测量批量处理的总吞吐量：

vllm bench throughput \
  --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
  --dtype float16 \
  --trust-remote-code \
  --input-len 256 \
  --output-len 256 \
  --num-prompts 100

关注指标：

Throughput: X.XX requests/s
Throughput: X.XX tokens/s（输出 tokens/秒）

2.3 在线服务测试（Serving Benchmark）

模拟真实在线服务场景（需先启动 API 服务）：

# 终端 1：启动服务（同 1.2）

# 终端 2：运行压测
vllm bench serve \
  --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
  --base-url http://localhost:8000 \
  --dataset-name random \
  --random-input-len 256 \
  --random-output-len 128 \
  --num-prompts 200 \
  --request-rate 5

关注指标：

Request throughput — 请求吞吐 (req/s)
Output token throughput — 输出 token 吞吐 (tok/s)
TTFT (Time to First Token) — 首 token 延迟（反映 prefill 性能）
TPOT (Time per Output Token) — 每 token 生成时间（反映 decode 性能）
ITL (Inter-Token Latency) — token 间延迟

2.4 不同批量大小对比

# 批量 1（延迟优先）
vllm bench latency --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
  --dtype float16 --trust-remote-code --input-len 128 --output-len 128 --batch-size 1

# 批量 8（吞吐优先）
vllm bench latency --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
  --dtype float16 --trust-remote-code --input-len 128 --output-len 128 --batch-size 8

# 批量 32（高吞吐）
vllm bench latency --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
  --dtype float16 --trust-remote-code --input-len 128 --output-len 128 --batch-size 32

三、建议的测试矩阵

3.1 基础性能摸底

测试项	input_len	output_len	batch_size	目的
单请求延迟	128	128	1	基线延迟
小批量延迟	128	128	8	批量效率
长输入	1024	128	1	prefill 性能
长输出	128	1024	1	decode 性能
高吞吐	256	256	32	最大吞吐

3.2 和老方案对比（如果老环境还在）

如果旧的 vllm_0.2.7_ascend 环境还可用，可以用相同参数对比：

# 在旧环境中运行（0.2.7 使用旧的脚本）
cd /path/to/vllm_0.2.7_ascend
python benchmarks/benchmark_latency.py \
  --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
  --dtype float16 --input-len 128 --output-len 128 --batch-size 1

# 在新环境中运行
vllm bench latency \
  --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
  --dtype float16 --trust-remote-code --input-len 128 --output-len 128 --batch-size 1

注意：由于 vLLM 从 0.2.7 → 0.11.0 本身有大量架构升级（V1 引擎、Continuous Batching 改进等），新方案在吞吐量方面应该有显著优势，尤其是并发场景。

3.3 稳定性测试

# 长时间压测（600 秒 = 10 分钟）
vllm bench serve \
  --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
  --base-url http://localhost:8000 \
  --dataset-name random \
  --random-input-len 256 \
  --random-output-len 256 \
  --num-prompts 1000 \
  --request-rate 10

关注：

是否有请求失败
延迟是否随时间增长（内存泄漏）
NPU 显存是否稳定（npu-smi info 监控）

四、监控工具

# 实时监控 NPU 使用率和显存
watch -n 1 npu-smi info

# 查看 NPU 详细信息
npu-smi info -t usages -i 0

# Python 中获取显存信息
python -c "
import torch
import torch_npu
print(f'Total: {torch.npu.get_device_properties(0).total_memory / 1024**3:.1f} GB')
print(f'Allocated: {torch.npu.memory_allocated(0) / 1024**3:.1f} GB')
print(f'Cached: {torch.npu.memory_reserved(0) / 1024**3:.1f} GB')
"

五、下一步工作建议

短期（验证阶段）

✅ 跑通基础推理（已完成）
🔲 运行 latency benchmark，记录基线数据
🔲 运行 throughput benchmark，评估吞吐
🔲 运行 serving benchmark，评估在线服务性能
🔲 如有旧环境，做新旧对比

中期（优化阶段）

🔲 测试量化模型（W8A8），对比精度和性能
🔲 测试 Torchair 图编译模式是否可用
🔲 测试 chunked prefill 对长文本的效果
🔲 测试更大模型（如 Qwen2.5-72B，需多卡 TP）

长期（生产部署）

🔲 稳定性长跑测试（24h+）
🔲 多卡 Tensor Parallel 验证
🔲 集成到业务 API 网关
🔲 编译 C++ 扩展（vllm_npu_C）获取完整性能

六、参考命令速查

# 延迟测试
vllm bench latency --model <MODEL> --dtype float16 --trust-remote-code --input-len 128 --output-len 128 --batch-size 1

# 吞吐测试
vllm bench throughput --model <MODEL> --dtype float16 --trust-remote-code --input-len 256 --output-len 256 --num-prompts 100

# 服务压测（需先启动 API 服务器）
vllm bench serve --model <MODEL> --base-url http://localhost:8000 --dataset-name random --random-input-len 256 --random-output-len 128 --num-prompts 200 --request-rate 5

7.0 KiB Raw Permalink Blame History Unescape Escape

vllm-npu-plugin 验证与基准测试指南

一、功能验证（先确认能正常工作）

1.1 离线推理测试

1.2 OpenAI API 服务测试

1.3 流式输出测试

二、性能基准测试

2.1 延迟测试（Latency）

2.2 吞吐量测试（Throughput）

2.3 在线服务测试（Serving Benchmark）

2.4 不同批量大小对比

三、建议的测试矩阵

3.1 基础性能摸底

3.2 和老方案对比（如果老环境还在）

3.3 稳定性测试

四、监控工具

五、下一步工作建议

短期（验证阶段）

中期（优化阶段）

长期（生产部署）

六、参考命令速查

7.0 KiB

Raw Permalink Blame History