feat: 添加环境安装指南、验证与基准测试文档

2026-02-20 11:42:30 +00:00 · 2026-02-11 00:13:26 +08:00
parent ae10ce68f0
commit c00c47a5b2
3 changed files with 635 additions and 0 deletions
--- a/BENCHMARK.md
+++ b/BENCHMARK.md
@@ -0,0 +1,253 @@
+# vllm-npu-plugin 验证与基准测试指南
+
+## 一、功能验证（先确认能正常工作）
+
+### 1.1 离线推理测试
+
+```bash
+cd /workspace/mnt/vllm_ascend/vllm-npu-plugin
+
+python -c "
+from vllm import LLM
+llm = LLM(
+    model='/workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct',
+    dtype='float16',
+    trust_remote_code=True,
+)
+outputs = llm.generate(['你好，请简单介绍一下自己'])
+for out in outputs:
+    print(out.outputs[0].text)
+"
+```
+
+预期：模型正常加载，输出合理的中文回复。✅ 已验证通过。
+
+### 1.2 OpenAI API 服务测试
+
+```bash
+# 终端 1：启动服务
+python -m vllm.entrypoints.openai.api_server \
+  --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
+  --dtype float16 \
+  --trust-remote-code \
+  --host 0.0.0.0 \
+  --port 8000
+
+# 终端 2：发送请求
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "/workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct",
+    "messages": [{"role": "user", "content": "什么是大语言模型？"}],
+    "max_tokens": 256
+  }'
+```
+
+### 1.3 流式输出测试
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "/workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct",
+    "messages": [{"role": "user", "content": "用Python写一个快速排序"}],
+    "max_tokens": 512,
+    "stream": true
+  }'
+```
+
+---
+
+## 二、性能基准测试
+
+vLLM v0.11.0 内置了基准测试工具，通过 `vllm bench` 命令调用。
+
+### 2.1 延迟测试（Latency）
+
+测量单次请求从输入到输出的端到端延迟：
+
+```bash
+vllm bench latency \
+  --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
+  --dtype float16 \
+  --trust-remote-code \
+  --input-len 128 \
+  --output-len 128 \
+  --batch-size 1 \
+  --num-iters 10
+```
+
+**关注指标**：
+- `avg latency` — 平均延迟（秒）
+- `p99 latency` — P99 延迟
+
+### 2.2 吞吐量测试（Throughput）
+
+测量批量处理的总吞吐量：
+
+```bash
+vllm bench throughput \
+  --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
+  --dtype float16 \
+  --trust-remote-code \
+  --input-len 256 \
+  --output-len 256 \
+  --num-prompts 100
+```
+
+**关注指标**：
+- `Throughput: X.XX requests/s`
+- `Throughput: X.XX tokens/s`（输出 tokens/秒）
+
+### 2.3 在线服务测试（Serving Benchmark）
+
+模拟真实在线服务场景（需先启动 API 服务）：
+
+```bash
+# 终端 1：启动服务（同 1.2）
+
+# 终端 2：运行压测
+vllm bench serve \
+  --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
+  --base-url http://localhost:8000 \
+  --dataset-name random \
+  --random-input-len 256 \
+  --random-output-len 128 \
+  --num-prompts 200 \
+  --request-rate 5
+```
+
+**关注指标**：
+- `Request throughput` — 请求吞吐 (req/s)
+- `Output token throughput` — 输出 token 吞吐 (tok/s)
+- `TTFT (Time to First Token)` — 首 token 延迟（反映 prefill 性能）
+- `TPOT (Time per Output Token)` — 每 token 生成时间（反映 decode 性能）
+- `ITL (Inter-Token Latency)` — token 间延迟
+
+### 2.4 不同批量大小对比
+
+```bash
+# 批量 1（延迟优先）
+vllm bench latency --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
+  --dtype float16 --trust-remote-code --input-len 128 --output-len 128 --batch-size 1
+
+# 批量 8（吞吐优先）
+vllm bench latency --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
+  --dtype float16 --trust-remote-code --input-len 128 --output-len 128 --batch-size 8
+
+# 批量 32（高吞吐）
+vllm bench latency --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
+  --dtype float16 --trust-remote-code --input-len 128 --output-len 128 --batch-size 32
+```
+
+---
+
+## 三、建议的测试矩阵
+
+### 3.1 基础性能摸底
+
+| 测试项 | input_len | output_len | batch_size | 目的 |
+|--------|-----------|------------|------------|------|
+| 单请求延迟 | 128 | 128 | 1 | 基线延迟 |
+| 小批量延迟 | 128 | 128 | 8 | 批量效率 |
+| 长输入 | 1024 | 128 | 1 | prefill 性能 |
+| 长输出 | 128 | 1024 | 1 | decode 性能 |
+| 高吞吐 | 256 | 256 | 32 | 最大吞吐 |
+
+### 3.2 和老方案对比（如果老环境还在）
+
+如果旧的 vllm_0.2.7_ascend 环境还可用，可以用相同参数对比：
+
+```bash
+# 在旧环境中运行（0.2.7 使用旧的脚本）
+cd /path/to/vllm_0.2.7_ascend
+python benchmarks/benchmark_latency.py \
+  --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
+  --dtype float16 --input-len 128 --output-len 128 --batch-size 1
+
+# 在新环境中运行
+vllm bench latency \
+  --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
+  --dtype float16 --trust-remote-code --input-len 128 --output-len 128 --batch-size 1
+```
+
+> **注意**：由于 vLLM 从 0.2.7 → 0.11.0 本身有大量架构升级（V1 引擎、Continuous Batching 改进等），
+> 新方案在吞吐量方面应该有显著优势，尤其是并发场景。
+
+### 3.3 稳定性测试
+
+```bash
+# 长时间压测（600 秒 = 10 分钟）
+vllm bench serve \
+  --model /workspace/mnt/vllm_ascend/Qwen2.5-7B-Instruct \
+  --base-url http://localhost:8000 \
+  --dataset-name random \
+  --random-input-len 256 \
+  --random-output-len 256 \
+  --num-prompts 1000 \
+  --request-rate 10
+```
+
+关注：
+- 是否有请求失败
+- 延迟是否随时间增长（内存泄漏）
+- NPU 显存是否稳定（`npu-smi info` 监控）
+
+---
+
+## 四、监控工具
+
+```bash
+# 实时监控 NPU 使用率和显存
+watch -n 1 npu-smi info
+
+# 查看 NPU 详细信息
+npu-smi info -t usages -i 0
+
+# Python 中获取显存信息
+python -c "
+import torch
+import torch_npu
+print(f'Total: {torch.npu.get_device_properties(0).total_memory / 1024**3:.1f} GB')
+print(f'Allocated: {torch.npu.memory_allocated(0) / 1024**3:.1f} GB')
+print(f'Cached: {torch.npu.memory_reserved(0) / 1024**3:.1f} GB')
+"
+```
+
+---
+
+## 五、下一步工作建议
+
+### 短期（验证阶段）
+1. ✅ 跑通基础推理（已完成）
+2. 🔲 运行 latency benchmark，记录基线数据
+3. 🔲 运行 throughput benchmark，评估吞吐
+4. 🔲 运行 serving benchmark，评估在线服务性能
+5. 🔲 如有旧环境，做新旧对比
+
+### 中期（优化阶段）
+6. 🔲 测试量化模型（W8A8），对比精度和性能
+7. 🔲 测试 Torchair 图编译模式是否可用
+8. 🔲 测试 chunked prefill 对长文本的效果
+9. 🔲 测试更大模型（如 Qwen2.5-72B，需多卡 TP）
+
+### 长期（生产部署）
+10. 🔲 稳定性长跑测试（24h+）
+11. 🔲 多卡 Tensor Parallel 验证
+12. 🔲 集成到业务 API 网关
+13. 🔲 编译 C++ 扩展（`vllm_npu_C`）获取完整性能
+
+---
+
+## 六、参考命令速查
+
+```bash
+# 延迟测试
+vllm bench latency --model <MODEL> --dtype float16 --trust-remote-code --input-len 128 --output-len 128 --batch-size 1
+
+# 吞吐测试
+vllm bench throughput --model <MODEL> --dtype float16 --trust-remote-code --input-len 256 --output-len 256 --num-prompts 100
+
+# 服务压测（需先启动 API 服务器）
+vllm bench serve --model <MODEL> --base-url http://localhost:8000 --dataset-name random --random-input-len 256 --random-output-len 128 --num-prompts 200 --request-rate 5
+```