mirror of
https://github.com/handsomezhuzhu/vllm-npu-plugin.git
synced 2026-02-20 11:42:30 +00:00
feat: initial vllm-npu-plugin for Ascend NPU adaptation
- NPUPlatform: device management, HCCL process group, config adaptation - AscendAttentionBackend: npu_fusion_attention (prefill) + npu_incre_flash_attention (decode) - NPUCommunicator: HCCL-based distributed communication - NPUWorker: NPU device init, memory profiling - Custom ops: SiluAndMul, RMS norm, rotary embedding - Plugin registered via vllm.platform_plugins entry point Based on vllm-ascend official pattern, targeting Ascend 910B
This commit is contained in:
95
README.md
Normal file
95
README.md
Normal file
@@ -0,0 +1,95 @@
|
||||
# vllm-npu-plugin
|
||||
|
||||
Ascend NPU platform plugin for vLLM v0.11.0.
|
||||
|
||||
## Overview
|
||||
|
||||
This package registers as an out-of-tree vLLM platform plugin via the `vllm.platform_plugins` entry-point group, enabling vLLM to run on Huawei Ascend NPU devices.
|
||||
|
||||
### Components
|
||||
|
||||
| Module | Description |
|
||||
|---|---|
|
||||
| `vllm_npu/platform.py` | `NPUPlatform` — device management, attention backend routing, config adaptation |
|
||||
| `vllm_npu/distributed/communicator.py` | `NPUCommunicator` — HCCL-based distributed communication |
|
||||
| `vllm_npu/attention/attention_v1.py` | `AscendAttentionBackend` — FlashAttention NPU kernels (prefill + decode) |
|
||||
| `vllm_npu/worker/worker_v1.py` | `NPUWorker` — NPU device initialization and memory profiling |
|
||||
| `vllm_npu/ops/` | NPU-optimized ops (SiLU+Mul, RMS norm, rotary embedding) |
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- **Hardware**: Huawei Ascend 910B/910C or compatible NPU
|
||||
- **Software**:
|
||||
- CANN (Compute Architecture for Neural Networks) 8.0+
|
||||
- `torch_npu` matching your PyTorch version
|
||||
- vLLM v0.11.0 (installed from source)
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# 1. Ensure vLLM v0.11.0 is installed with the feat/ascend-npu-adapt-v0.11.0 branch
|
||||
cd /path/to/vllm
|
||||
pip install -e .
|
||||
|
||||
# 2. Install this plugin
|
||||
cd /path/to/vllm_npu_plugin
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# Verify plugin is discoverable
|
||||
python -c "
|
||||
from vllm.plugins import load_plugins_by_group
|
||||
plugins = load_plugins_by_group('vllm.platform_plugins')
|
||||
print('Discovered plugins:', list(plugins.keys()))
|
||||
assert 'npu' in plugins, 'NPU plugin not found!'
|
||||
print('NPU plugin registered successfully!')
|
||||
"
|
||||
|
||||
# Verify platform detection (requires NPU hardware)
|
||||
python -c "
|
||||
from vllm.platforms import current_platform
|
||||
print(f'Current platform: {current_platform}')
|
||||
print(f'Device type: {current_platform.device_type}')
|
||||
"
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Once installed, vLLM will automatically detect the NPU platform if Ascend hardware is available:
|
||||
|
||||
```bash
|
||||
# Run inference on NPU
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model /path/to/model \
|
||||
--tensor-parallel-size 1 \
|
||||
--block-size 128
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
vllm (v0.11.0) vllm-npu-plugin
|
||||
┌─────────────────┐ ┌─────────────────────┐
|
||||
│ Platform Plugin │──entry_point──│ register() │
|
||||
│ Discovery │ │ → NPUPlatform │
|
||||
├─────────────────┤ ├─────────────────────┤
|
||||
│ AttentionBackend │◄──routing─────│ AscendAttentionBackend │
|
||||
│ Interface │ │ ├─ npu_fusion_attention│
|
||||
│ │ │ └─ npu_incre_flash_attn│
|
||||
├─────────────────┤ ├─────────────────────┤
|
||||
│ Worker Interface │◄──worker_cls──│ NPUWorker │
|
||||
│ │ │ ├─ HCCL distributed │
|
||||
│ │ │ └─ NPU memory mgmt │
|
||||
└─────────────────┘ └─────────────────────┘
|
||||
```
|
||||
|
||||
## Key API References
|
||||
|
||||
- **`torch_npu.npu_fusion_attention`** — Fused multi-head attention (prefill)
|
||||
- **`torch_npu.npu_incre_flash_attention`** — Incremental flash attention (decode)
|
||||
- **`torch_npu._npu_reshape_and_cache`** — KV cache update
|
||||
- **`torch_npu.npu_rms_norm`** / `npu_add_rms_norm` — Layer normalization
|
||||
- **`torch_npu.npu_swiglu`** — Fused SiLU + Mul activation
|
||||
Reference in New Issue
Block a user