feat: initial vllm-npu-plugin for Ascend NPU adaptation

- NPUPlatform: device management, HCCL process group, config adaptation - AscendAttentionBackend: npu_fusion_attention (prefill) + npu_incre_flash_attention (decode) - NPUCommunicator: HCCL-based distributed communication - NPUWorker: NPU device init, memory profiling - Custom ops: SiluAndMul, RMS norm, rotary embedding - Plugin registered via vllm.platform_plugins entry point Based on vllm-ascend official pattern, targeting Ascend 910B
2026-02-20 11:42:30 +00:00 · 2026-02-10 11:06:01 +08:00
commit e75504df72
15 changed files with 1344 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,95 @@
+# vllm-npu-plugin
+
+Ascend NPU platform plugin for vLLM v0.11.0.
+
+## Overview
+
+This package registers as an out-of-tree vLLM platform plugin via the `vllm.platform_plugins` entry-point group, enabling vLLM to run on Huawei Ascend NPU devices.
+
+### Components
+
+| Module | Description |
+|---|---|
+| `vllm_npu/platform.py` | `NPUPlatform` — device management, attention backend routing, config adaptation |
+| `vllm_npu/distributed/communicator.py` | `NPUCommunicator` — HCCL-based distributed communication |
+| `vllm_npu/attention/attention_v1.py` | `AscendAttentionBackend` — FlashAttention NPU kernels (prefill + decode) |
+| `vllm_npu/worker/worker_v1.py` | `NPUWorker` — NPU device initialization and memory profiling |
+| `vllm_npu/ops/` | NPU-optimized ops (SiLU+Mul, RMS norm, rotary embedding) |
+
+## Prerequisites
+
+- **Hardware**: Huawei Ascend 910B/910C or compatible NPU
+- **Software**:
+  - CANN (Compute Architecture for Neural Networks) 8.0+
+  - `torch_npu` matching your PyTorch version
+  - vLLM v0.11.0 (installed from source)
+
+## Installation
+
+```bash
+# 1. Ensure vLLM v0.11.0 is installed with the feat/ascend-npu-adapt-v0.11.0 branch
+cd /path/to/vllm
+pip install -e .
+
+# 2. Install this plugin
+cd /path/to/vllm_npu_plugin
+pip install -e .
+```
+
+## Verification
+
+```bash
+# Verify plugin is discoverable
+python -c "
+from vllm.plugins import load_plugins_by_group
+plugins = load_plugins_by_group('vllm.platform_plugins')
+print('Discovered plugins:', list(plugins.keys()))
+assert 'npu' in plugins, 'NPU plugin not found!'
+print('NPU plugin registered successfully!')
+"
+
+# Verify platform detection (requires NPU hardware)
+python -c "
+from vllm.platforms import current_platform
+print(f'Current platform: {current_platform}')
+print(f'Device type: {current_platform.device_type}')
+"
+```
+
+## Usage
+
+Once installed, vLLM will automatically detect the NPU platform if Ascend hardware is available:
+
+```bash
+# Run inference on NPU
+python -m vllm.entrypoints.openai.api_server \
+    --model /path/to/model \
+    --tensor-parallel-size 1 \
+    --block-size 128
+```
+
+## Architecture
+
+```
+vllm (v0.11.0)                    vllm-npu-plugin
+┌─────────────────┐               ┌─────────────────────┐
+│ Platform Plugin  │──entry_point──│ register()           │
+│ Discovery        │               │   → NPUPlatform      │
+├─────────────────┤               ├─────────────────────┤
+│ AttentionBackend │◄──routing─────│ AscendAttentionBackend │
+│ Interface        │               │  ├─ npu_fusion_attention│
+│                  │               │  └─ npu_incre_flash_attn│
+├─────────────────┤               ├─────────────────────┤
+│ Worker Interface │◄──worker_cls──│ NPUWorker            │
+│                  │               │  ├─ HCCL distributed  │
+│                  │               │  └─ NPU memory mgmt   │
+└─────────────────┘               └─────────────────────┘
+```
+
+## Key API References
+
+- **`torch_npu.npu_fusion_attention`** — Fused multi-head attention (prefill)
+- **`torch_npu.npu_incre_flash_attention`** — Incremental flash attention (decode)
+- **`torch_npu._npu_reshape_and_cache`** — KV cache update
+- **`torch_npu.npu_rms_norm`** / `npu_add_rms_norm` — Layer normalization
+- **`torch_npu.npu_swiglu`** — Fused SiLU + Mul activation