handsomezhuzhu e75504df72 feat: initial vllm-npu-plugin for Ascend NPU adaptation
- NPUPlatform: device management, HCCL process group, config adaptation
- AscendAttentionBackend: npu_fusion_attention (prefill) + npu_incre_flash_attention (decode)
- NPUCommunicator: HCCL-based distributed communication
- NPUWorker: NPU device init, memory profiling
- Custom ops: SiluAndMul, RMS norm, rotary embedding
- Plugin registered via vllm.platform_plugins entry point

Based on vllm-ascend official pattern, targeting Ascend 910B
2026-02-10 11:06:01 +08:00

vllm-npu-plugin

Ascend NPU platform plugin for vLLM v0.11.0.

Overview

This package registers as an out-of-tree vLLM platform plugin via the vllm.platform_plugins entry-point group, enabling vLLM to run on Huawei Ascend NPU devices.

Components

Module Description
vllm_npu/platform.py NPUPlatform — device management, attention backend routing, config adaptation
vllm_npu/distributed/communicator.py NPUCommunicator — HCCL-based distributed communication
vllm_npu/attention/attention_v1.py AscendAttentionBackend — FlashAttention NPU kernels (prefill + decode)
vllm_npu/worker/worker_v1.py NPUWorker — NPU device initialization and memory profiling
vllm_npu/ops/ NPU-optimized ops (SiLU+Mul, RMS norm, rotary embedding)

Prerequisites

  • Hardware: Huawei Ascend 910B/910C or compatible NPU
  • Software:
    • CANN (Compute Architecture for Neural Networks) 8.0+
    • torch_npu matching your PyTorch version
    • vLLM v0.11.0 (installed from source)

Installation

# 1. Ensure vLLM v0.11.0 is installed with the feat/ascend-npu-adapt-v0.11.0 branch
cd /path/to/vllm
pip install -e .

# 2. Install this plugin
cd /path/to/vllm_npu_plugin
pip install -e .

Verification

# Verify plugin is discoverable
python -c "
from vllm.plugins import load_plugins_by_group
plugins = load_plugins_by_group('vllm.platform_plugins')
print('Discovered plugins:', list(plugins.keys()))
assert 'npu' in plugins, 'NPU plugin not found!'
print('NPU plugin registered successfully!')
"

# Verify platform detection (requires NPU hardware)
python -c "
from vllm.platforms import current_platform
print(f'Current platform: {current_platform}')
print(f'Device type: {current_platform.device_type}')
"

Usage

Once installed, vLLM will automatically detect the NPU platform if Ascend hardware is available:

# Run inference on NPU
python -m vllm.entrypoints.openai.api_server \
    --model /path/to/model \
    --tensor-parallel-size 1 \
    --block-size 128

Architecture

vllm (v0.11.0)                    vllm-npu-plugin
┌─────────────────┐               ┌─────────────────────┐
│ Platform Plugin  │──entry_point──│ register()           │
│ Discovery        │               │   → NPUPlatform      │
├─────────────────┤               ├─────────────────────┤
│ AttentionBackend │◄──routing─────│ AscendAttentionBackend │
│ Interface        │               │  ├─ npu_fusion_attention│
│                  │               │  └─ npu_incre_flash_attn│
├─────────────────┤               ├─────────────────────┤
│ Worker Interface │◄──worker_cls──│ NPUWorker            │
│                  │               │  ├─ HCCL distributed  │
│                  │               │  └─ NPU memory mgmt   │
└─────────────────┘               └─────────────────────┘

Key API References

  • torch_npu.npu_fusion_attention — Fused multi-head attention (prefill)
  • torch_npu.npu_incre_flash_attention — Incremental flash attention (decode)
  • torch_npu._npu_reshape_and_cache — KV cache update
  • torch_npu.npu_rms_norm / npu_add_rms_norm — Layer normalization
  • torch_npu.npu_swiglu — Fused SiLU + Mul activation
Description
No description provided
Readme 560 KiB
Languages
Python 100%