mirror of
https://github.com/handsomezhuzhu/vllm-npu-plugin.git
synced 2026-02-20 11:42:30 +00:00
feat: 添加环境安装指南、验证与基准测试文档
This commit is contained in:
253
BENCHMARK.md
Normal file
253
BENCHMARK.md
Normal file
@@ -0,0 +1,253 @@
|
||||
# vllm-npu-plugin 验证与基准测试指南
|
||||
|
||||
## 一、功能验证(先确认能正常工作)
|
||||
|
||||
### 1.1 离线推理测试
|
||||
|
||||
```bash
|
||||
cd /workspace/mnt/vllm_ascend/vllm-npu-plugin
|
||||
|
||||
python -c "
|
||||
from vllm import LLM
|
||||
llm = LLM(
|
||||
model='/workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct',
|
||||
dtype='float16',
|
||||
trust_remote_code=True,
|
||||
)
|
||||
outputs = llm.generate(['你好,请简单介绍一下自己'])
|
||||
for out in outputs:
|
||||
print(out.outputs[0].text)
|
||||
"
|
||||
```
|
||||
|
||||
预期:模型正常加载,输出合理的中文回复。✅ 已验证通过。
|
||||
|
||||
### 1.2 OpenAI API 服务测试
|
||||
|
||||
```bash
|
||||
# 终端 1:启动服务
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
|
||||
--dtype float16 \
|
||||
--trust-remote-code \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000
|
||||
|
||||
# 终端 2:发送请求
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "/workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct",
|
||||
"messages": [{"role": "user", "content": "什么是大语言模型?"}],
|
||||
"max_tokens": 256
|
||||
}'
|
||||
```
|
||||
|
||||
### 1.3 流式输出测试
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "/workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct",
|
||||
"messages": [{"role": "user", "content": "用Python写一个快速排序"}],
|
||||
"max_tokens": 512,
|
||||
"stream": true
|
||||
}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 二、性能基准测试
|
||||
|
||||
vLLM v0.11.0 内置了基准测试工具,通过 `vllm bench` 命令调用。
|
||||
|
||||
### 2.1 延迟测试(Latency)
|
||||
|
||||
测量单次请求从输入到输出的端到端延迟:
|
||||
|
||||
```bash
|
||||
vllm bench latency \
|
||||
--model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
|
||||
--dtype float16 \
|
||||
--trust-remote-code \
|
||||
--input-len 128 \
|
||||
--output-len 128 \
|
||||
--batch-size 1 \
|
||||
--num-iters 10
|
||||
```
|
||||
|
||||
**关注指标**:
|
||||
- `avg latency` — 平均延迟(秒)
|
||||
- `p99 latency` — P99 延迟
|
||||
|
||||
### 2.2 吞吐量测试(Throughput)
|
||||
|
||||
测量批量处理的总吞吐量:
|
||||
|
||||
```bash
|
||||
vllm bench throughput \
|
||||
--model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
|
||||
--dtype float16 \
|
||||
--trust-remote-code \
|
||||
--input-len 256 \
|
||||
--output-len 256 \
|
||||
--num-prompts 100
|
||||
```
|
||||
|
||||
**关注指标**:
|
||||
- `Throughput: X.XX requests/s`
|
||||
- `Throughput: X.XX tokens/s`(输出 tokens/秒)
|
||||
|
||||
### 2.3 在线服务测试(Serving Benchmark)
|
||||
|
||||
模拟真实在线服务场景(需先启动 API 服务):
|
||||
|
||||
```bash
|
||||
# 终端 1:启动服务(同 1.2)
|
||||
|
||||
# 终端 2:运行压测
|
||||
vllm bench serve \
|
||||
--model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
|
||||
--base-url http://localhost:8000 \
|
||||
--dataset-name random \
|
||||
--random-input-len 256 \
|
||||
--random-output-len 128 \
|
||||
--num-prompts 200 \
|
||||
--request-rate 5
|
||||
```
|
||||
|
||||
**关注指标**:
|
||||
- `Request throughput` — 请求吞吐 (req/s)
|
||||
- `Output token throughput` — 输出 token 吞吐 (tok/s)
|
||||
- `TTFT (Time to First Token)` — 首 token 延迟(反映 prefill 性能)
|
||||
- `TPOT (Time per Output Token)` — 每 token 生成时间(反映 decode 性能)
|
||||
- `ITL (Inter-Token Latency)` — token 间延迟
|
||||
|
||||
### 2.4 不同批量大小对比
|
||||
|
||||
```bash
|
||||
# 批量 1(延迟优先)
|
||||
vllm bench latency --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
|
||||
--dtype float16 --trust-remote-code --input-len 128 --output-len 128 --batch-size 1
|
||||
|
||||
# 批量 8(吞吐优先)
|
||||
vllm bench latency --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
|
||||
--dtype float16 --trust-remote-code --input-len 128 --output-len 128 --batch-size 8
|
||||
|
||||
# 批量 32(高吞吐)
|
||||
vllm bench latency --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
|
||||
--dtype float16 --trust-remote-code --input-len 128 --output-len 128 --batch-size 32
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 三、建议的测试矩阵
|
||||
|
||||
### 3.1 基础性能摸底
|
||||
|
||||
| 测试项 | input_len | output_len | batch_size | 目的 |
|
||||
|--------|-----------|------------|------------|------|
|
||||
| 单请求延迟 | 128 | 128 | 1 | 基线延迟 |
|
||||
| 小批量延迟 | 128 | 128 | 8 | 批量效率 |
|
||||
| 长输入 | 1024 | 128 | 1 | prefill 性能 |
|
||||
| 长输出 | 128 | 1024 | 1 | decode 性能 |
|
||||
| 高吞吐 | 256 | 256 | 32 | 最大吞吐 |
|
||||
|
||||
### 3.2 和老方案对比(如果老环境还在)
|
||||
|
||||
如果旧的 vllm_0.2.7_ascend 环境还可用,可以用相同参数对比:
|
||||
|
||||
```bash
|
||||
# 在旧环境中运行(0.2.7 使用旧的脚本)
|
||||
cd /path/to/vllm_0.2.7_ascend
|
||||
python benchmarks/benchmark_latency.py \
|
||||
--model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
|
||||
--dtype float16 --input-len 128 --output-len 128 --batch-size 1
|
||||
|
||||
# 在新环境中运行
|
||||
vllm bench latency \
|
||||
--model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
|
||||
--dtype float16 --trust-remote-code --input-len 128 --output-len 128 --batch-size 1
|
||||
```
|
||||
|
||||
> **注意**:由于 vLLM 从 0.2.7 → 0.11.0 本身有大量架构升级(V1 引擎、Continuous Batching 改进等),
|
||||
> 新方案在吞吐量方面应该有显著优势,尤其是并发场景。
|
||||
|
||||
### 3.3 稳定性测试
|
||||
|
||||
```bash
|
||||
# 长时间压测(600 秒 = 10 分钟)
|
||||
vllm bench serve \
|
||||
--model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
|
||||
--base-url http://localhost:8000 \
|
||||
--dataset-name random \
|
||||
--random-input-len 256 \
|
||||
--random-output-len 256 \
|
||||
--num-prompts 1000 \
|
||||
--request-rate 10
|
||||
```
|
||||
|
||||
关注:
|
||||
- 是否有请求失败
|
||||
- 延迟是否随时间增长(内存泄漏)
|
||||
- NPU 显存是否稳定(`npu-smi info` 监控)
|
||||
|
||||
---
|
||||
|
||||
## 四、监控工具
|
||||
|
||||
```bash
|
||||
# 实时监控 NPU 使用率和显存
|
||||
watch -n 1 npu-smi info
|
||||
|
||||
# 查看 NPU 详细信息
|
||||
npu-smi info -t usages -i 0
|
||||
|
||||
# Python 中获取显存信息
|
||||
python -c "
|
||||
import torch
|
||||
import torch_npu
|
||||
print(f'Total: {torch.npu.get_device_properties(0).total_memory / 1024**3:.1f} GB')
|
||||
print(f'Allocated: {torch.npu.memory_allocated(0) / 1024**3:.1f} GB')
|
||||
print(f'Cached: {torch.npu.memory_reserved(0) / 1024**3:.1f} GB')
|
||||
"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 五、下一步工作建议
|
||||
|
||||
### 短期(验证阶段)
|
||||
1. ✅ 跑通基础推理(已完成)
|
||||
2. 🔲 运行 latency benchmark,记录基线数据
|
||||
3. 🔲 运行 throughput benchmark,评估吞吐
|
||||
4. 🔲 运行 serving benchmark,评估在线服务性能
|
||||
5. 🔲 如有旧环境,做新旧对比
|
||||
|
||||
### 中期(优化阶段)
|
||||
6. 🔲 测试量化模型(W8A8),对比精度和性能
|
||||
7. 🔲 测试 Torchair 图编译模式是否可用
|
||||
8. 🔲 测试 chunked prefill 对长文本的效果
|
||||
9. 🔲 测试更大模型(如 Qwen2.5-72B,需多卡 TP)
|
||||
|
||||
### 长期(生产部署)
|
||||
10. 🔲 稳定性长跑测试(24h+)
|
||||
11. 🔲 多卡 Tensor Parallel 验证
|
||||
12. 🔲 集成到业务 API 网关
|
||||
13. 🔲 编译 C++ 扩展(`vllm_npu_C`)获取完整性能
|
||||
|
||||
---
|
||||
|
||||
## 六、参考命令速查
|
||||
|
||||
```bash
|
||||
# 延迟测试
|
||||
vllm bench latency --model <MODEL> --dtype float16 --trust-remote-code --input-len 128 --output-len 128 --batch-size 1
|
||||
|
||||
# 吞吐测试
|
||||
vllm bench throughput --model <MODEL> --dtype float16 --trust-remote-code --input-len 256 --output-len 256 --num-prompts 100
|
||||
|
||||
# 服务压测(需先启动 API 服务器)
|
||||
vllm bench serve --model <MODEL> --base-url http://localhost:8000 --dataset-name random --random-input-len 256 --random-output-len 128 --num-prompts 200 --request-rate 5
|
||||
```
|
||||
Reference in New Issue
Block a user