feat: initial vllm-npu-plugin for Ascend NPU adaptation

- NPUPlatform: device management, HCCL process group, config adaptation
- AscendAttentionBackend: npu_fusion_attention (prefill) + npu_incre_flash_attention (decode)
- NPUCommunicator: HCCL-based distributed communication
- NPUWorker: NPU device init, memory profiling
- Custom ops: SiluAndMul, RMS norm, rotary embedding
- Plugin registered via vllm.platform_plugins entry point

Based on vllm-ascend official pattern, targeting Ascend 910B
This commit is contained in:
2026-02-10 11:06:01 +08:00
commit e75504df72
15 changed files with 1344 additions and 0 deletions

95
README.md Normal file
View File

@@ -0,0 +1,95 @@
# vllm-npu-plugin
Ascend NPU platform plugin for vLLM v0.11.0.
## Overview
This package registers as an out-of-tree vLLM platform plugin via the `vllm.platform_plugins` entry-point group, enabling vLLM to run on Huawei Ascend NPU devices.
### Components
| Module | Description |
|---|---|
| `vllm_npu/platform.py` | `NPUPlatform` — device management, attention backend routing, config adaptation |
| `vllm_npu/distributed/communicator.py` | `NPUCommunicator` — HCCL-based distributed communication |
| `vllm_npu/attention/attention_v1.py` | `AscendAttentionBackend` — FlashAttention NPU kernels (prefill + decode) |
| `vllm_npu/worker/worker_v1.py` | `NPUWorker` — NPU device initialization and memory profiling |
| `vllm_npu/ops/` | NPU-optimized ops (SiLU+Mul, RMS norm, rotary embedding) |
## Prerequisites
- **Hardware**: Huawei Ascend 910B/910C or compatible NPU
- **Software**:
- CANN (Compute Architecture for Neural Networks) 8.0+
- `torch_npu` matching your PyTorch version
- vLLM v0.11.0 (installed from source)
## Installation
```bash
# 1. Ensure vLLM v0.11.0 is installed with the feat/ascend-npu-adapt-v0.11.0 branch
cd /path/to/vllm
pip install -e .
# 2. Install this plugin
cd /path/to/vllm_npu_plugin
pip install -e .
```
## Verification
```bash
# Verify plugin is discoverable
python -c "
from vllm.plugins import load_plugins_by_group
plugins = load_plugins_by_group('vllm.platform_plugins')
print('Discovered plugins:', list(plugins.keys()))
assert 'npu' in plugins, 'NPU plugin not found!'
print('NPU plugin registered successfully!')
"
# Verify platform detection (requires NPU hardware)
python -c "
from vllm.platforms import current_platform
print(f'Current platform: {current_platform}')
print(f'Device type: {current_platform.device_type}')
"
```
## Usage
Once installed, vLLM will automatically detect the NPU platform if Ascend hardware is available:
```bash
# Run inference on NPU
python -m vllm.entrypoints.openai.api_server \
--model /path/to/model \
--tensor-parallel-size 1 \
--block-size 128
```
## Architecture
```
vllm (v0.11.0) vllm-npu-plugin
┌─────────────────┐ ┌─────────────────────┐
│ Platform Plugin │──entry_point──│ register() │
│ Discovery │ │ → NPUPlatform │
├─────────────────┤ ├─────────────────────┤
│ AttentionBackend │◄──routing─────│ AscendAttentionBackend │
│ Interface │ │ ├─ npu_fusion_attention│
│ │ │ └─ npu_incre_flash_attn│
├─────────────────┤ ├─────────────────────┤
│ Worker Interface │◄──worker_cls──│ NPUWorker │
│ │ │ ├─ HCCL distributed │
│ │ │ └─ NPU memory mgmt │
└─────────────────┘ └─────────────────────┘
```
## Key API References
- **`torch_npu.npu_fusion_attention`** — Fused multi-head attention (prefill)
- **`torch_npu.npu_incre_flash_attention`** — Incremental flash attention (decode)
- **`torch_npu._npu_reshape_and_cache`** — KV cache update
- **`torch_npu.npu_rms_norm`** / `npu_add_rms_norm` — Layer normalization
- **`torch_npu.npu_swiglu`** — Fused SiLU + Mul activation