mirror of https://github.com/handsomezhuzhu/vllm-npu-plugin.git synced 2026-02-20 11:42:30 +00:00

Go to file

handsomezhuzhu e75504df72 feat: initial vllm-npu-plugin for Ascend NPU adaptation

- NPUPlatform: device management, HCCL process group, config adaptation
- AscendAttentionBackend: npu_fusion_attention (prefill) + npu_incre_flash_attention (decode)
- NPUCommunicator: HCCL-based distributed communication
- NPUWorker: NPU device init, memory profiling
- Custom ops: SiluAndMul, RMS norm, rotary embedding
- Plugin registered via vllm.platform_plugins entry point

Based on vllm-ascend official pattern, targeting Ascend 910B

2026-02-10 11:06:01 +08:00

vllm_npu

feat: initial vllm-npu-plugin for Ascend NPU adaptation

2026-02-10 11:06:01 +08:00

.gitignore

feat: initial vllm-npu-plugin for Ascend NPU adaptation

2026-02-10 11:06:01 +08:00

README.md

feat: initial vllm-npu-plugin for Ascend NPU adaptation

2026-02-10 11:06:01 +08:00

setup.py

feat: initial vllm-npu-plugin for Ascend NPU adaptation

2026-02-10 11:06:01 +08:00

README.md

vllm-npu-plugin

Ascend NPU platform plugin for vLLM v0.11.0.

Overview

This package registers as an out-of-tree vLLM platform plugin via the vllm.platform_plugins entry-point group, enabling vLLM to run on Huawei Ascend NPU devices.

Components

Module	Description
`vllm_npu/platform.py`	`NPUPlatform` — device management, attention backend routing, config adaptation
`vllm_npu/distributed/communicator.py`	`NPUCommunicator` — HCCL-based distributed communication
`vllm_npu/attention/attention_v1.py`	`AscendAttentionBackend` — FlashAttention NPU kernels (prefill + decode)
`vllm_npu/worker/worker_v1.py`	`NPUWorker` — NPU device initialization and memory profiling
`vllm_npu/ops/`	NPU-optimized ops (SiLU+Mul, RMS norm, rotary embedding)

Prerequisites

Hardware: Huawei Ascend 910B/910C or compatible NPU
Software:
- CANN (Compute Architecture for Neural Networks) 8.0+
- torch_npu matching your PyTorch version
- vLLM v0.11.0 (installed from source)

Installation

# 1. Ensure vLLM v0.11.0 is installed with the feat/ascend-npu-adapt-v0.11.0 branch
cd /path/to/vllm
pip install -e .

# 2. Install this plugin
cd /path/to/vllm_npu_plugin
pip install -e .

Verification

# Verify plugin is discoverable
python -c "
from vllm.plugins import load_plugins_by_group
plugins = load_plugins_by_group('vllm.platform_plugins')
print('Discovered plugins:', list(plugins.keys()))
assert 'npu' in plugins, 'NPU plugin not found!'
print('NPU plugin registered successfully!')
"

# Verify platform detection (requires NPU hardware)
python -c "
from vllm.platforms import current_platform
print(f'Current platform: {current_platform}')
print(f'Device type: {current_platform.device_type}')
"

Usage

Once installed, vLLM will automatically detect the NPU platform if Ascend hardware is available:

# Run inference on NPU
python -m vllm.entrypoints.openai.api_server \
    --model /path/to/model \
    --tensor-parallel-size 1 \
    --block-size 128

Architecture

vllm (v0.11.0)                    vllm-npu-plugin
┌─────────────────┐               ┌─────────────────────┐
│ Platform Plugin  │──entry_point──│ register()           │
│ Discovery        │               │   → NPUPlatform      │
├─────────────────┤               ├─────────────────────┤
│ AttentionBackend │◄──routing─────│ AscendAttentionBackend │
│ Interface        │               │  ├─ npu_fusion_attention│
│                  │               │  └─ npu_incre_flash_attn│
├─────────────────┤               ├─────────────────────┤
│ Worker Interface │◄──worker_cls──│ NPUWorker            │
│                  │               │  ├─ HCCL distributed  │
│                  │               │  └─ NPU memory mgmt   │
└─────────────────┘               └─────────────────────┘

Key API References

torch_npu.npu_fusion_attention — Fused multi-head attention (prefill)
torch_npu.npu_incre_flash_attention — Incremental flash attention (decode)
torch_npu._npu_reshape_and_cache — KV cache update
torch_npu.npu_rms_norm / npu_add_rms_norm — Layer normalization
torch_npu.npu_swiglu — Fused SiLU + Mul activation