mirror of
https://github.com/handsomezhuzhu/vllm-npu-plugin.git
synced 2026-02-20 11:42:30 +00:00
43a2ed2f471475e6681b9d3bb109fcc55be83661
vllm-npu-plugin
Ascend NPU platform plugin for vLLM v0.11.0.
Overview
This package registers as an out-of-tree vLLM platform plugin via the vllm.platform_plugins entry-point group, enabling vLLM to run on Huawei Ascend NPU devices.
Components
| Module | Description |
|---|---|
vllm_npu/platform.py |
NPUPlatform — device management, attention backend routing, config adaptation |
vllm_npu/distributed/communicator.py |
NPUCommunicator — HCCL-based distributed communication |
vllm_npu/attention/attention_v1.py |
AscendAttentionBackend — FlashAttention NPU kernels (prefill + decode) |
vllm_npu/worker/worker_v1.py |
NPUWorker — NPU device initialization and memory profiling |
vllm_npu/ops/ |
NPU-optimized ops (SiLU+Mul, RMS norm, rotary embedding) |
Prerequisites
- Hardware: Huawei Ascend 910B/910C or compatible NPU
- Software:
- CANN (Compute Architecture for Neural Networks) 8.0+
torch_npumatching your PyTorch version- vLLM v0.11.0 (installed from source)
Installation
# 1. Ensure vLLM v0.11.0 is installed with the feat/ascend-npu-adapt-v0.11.0 branch
cd /path/to/vllm
pip install -e .
# 2. Install this plugin
cd /path/to/vllm_npu_plugin
pip install -e .
Verification
# Verify plugin is discoverable
python -c "
from vllm.plugins import load_plugins_by_group
plugins = load_plugins_by_group('vllm.platform_plugins')
print('Discovered plugins:', list(plugins.keys()))
assert 'npu' in plugins, 'NPU plugin not found!'
print('NPU plugin registered successfully!')
"
# Verify platform detection (requires NPU hardware)
python -c "
from vllm.platforms import current_platform
print(f'Current platform: {current_platform}')
print(f'Device type: {current_platform.device_type}')
"
Usage
Once installed, vLLM will automatically detect the NPU platform if Ascend hardware is available:
# Run inference on NPU
python -m vllm.entrypoints.openai.api_server \
--model /path/to/model \
--tensor-parallel-size 1 \
--block-size 128
Architecture
vllm (v0.11.0) vllm-npu-plugin
┌─────────────────┐ ┌─────────────────────┐
│ Platform Plugin │──entry_point──│ register() │
│ Discovery │ │ → NPUPlatform │
├─────────────────┤ ├─────────────────────┤
│ AttentionBackend │◄──routing─────│ AscendAttentionBackend │
│ Interface │ │ ├─ npu_fusion_attention│
│ │ │ └─ npu_incre_flash_attn│
├─────────────────┤ ├─────────────────────┤
│ Worker Interface │◄──worker_cls──│ NPUWorker │
│ │ │ ├─ HCCL distributed │
│ │ │ └─ NPU memory mgmt │
└─────────────────┘ └─────────────────────┘
Key API References
torch_npu.npu_fusion_attention— Fused multi-head attention (prefill)torch_npu.npu_incre_flash_attention— Incremental flash attention (decode)torch_npu._npu_reshape_and_cache— KV cache updatetorch_npu.npu_rms_norm/npu_add_rms_norm— Layer normalizationtorch_npu.npu_swiglu— Fused SiLU + Mul activation
Description
Languages
Python
100%