# vllm-npu-plugin Ascend NPU platform plugin for vLLM v0.11.0. ## Overview This package registers as an out-of-tree vLLM platform plugin via the `vllm.platform_plugins` entry-point group, enabling vLLM to run on Huawei Ascend NPU devices. ### Components | Module | Description | |---|---| | `vllm_npu/platform.py` | `NPUPlatform` — device management, attention backend routing, config adaptation | | `vllm_npu/distributed/communicator.py` | `NPUCommunicator` — HCCL-based distributed communication | | `vllm_npu/attention/attention_v1.py` | `AscendAttentionBackend` — FlashAttention NPU kernels (prefill + decode) | | `vllm_npu/worker/worker_v1.py` | `NPUWorker` — NPU device initialization and memory profiling | | `vllm_npu/ops/` | NPU-optimized ops (SiLU+Mul, RMS norm, rotary embedding) | ## Prerequisites - **Hardware**: Huawei Ascend 910B/910C or compatible NPU - **Software**: - CANN (Compute Architecture for Neural Networks) 8.0+ - `torch_npu` matching your PyTorch version - vLLM v0.11.0 (installed from source) ## Installation ```bash # 1. Ensure vLLM v0.11.0 is installed with the feat/ascend-npu-adapt-v0.11.0 branch cd /path/to/vllm pip install -e . # 2. Install this plugin cd /path/to/vllm_npu_plugin pip install -e . ``` ## Verification ```bash # Verify plugin is discoverable python -c " from vllm.plugins import load_plugins_by_group plugins = load_plugins_by_group('vllm.platform_plugins') print('Discovered plugins:', list(plugins.keys())) assert 'npu' in plugins, 'NPU plugin not found!' print('NPU plugin registered successfully!') " # Verify platform detection (requires NPU hardware) python -c " from vllm.platforms import current_platform print(f'Current platform: {current_platform}') print(f'Device type: {current_platform.device_type}') " ``` ## Usage Once installed, vLLM will automatically detect the NPU platform if Ascend hardware is available: ```bash # Run inference on NPU python -m vllm.entrypoints.openai.api_server \ --model /path/to/model \ --tensor-parallel-size 1 \ --block-size 128 ``` ## Architecture ``` vllm (v0.11.0) vllm-npu-plugin ┌─────────────────┐ ┌─────────────────────┐ │ Platform Plugin │──entry_point──│ register() │ │ Discovery │ │ → NPUPlatform │ ├─────────────────┤ ├─────────────────────┤ │ AttentionBackend │◄──routing─────│ AscendAttentionBackend │ │ Interface │ │ ├─ npu_fusion_attention│ │ │ │ └─ npu_incre_flash_attn│ ├─────────────────┤ ├─────────────────────┤ │ Worker Interface │◄──worker_cls──│ NPUWorker │ │ │ │ ├─ HCCL distributed │ │ │ │ └─ NPU memory mgmt │ └─────────────────┘ └─────────────────────┘ ``` ## Key API References - **`torch_npu.npu_fusion_attention`** — Fused multi-head attention (prefill) - **`torch_npu.npu_incre_flash_attention`** — Incremental flash attention (decode) - **`torch_npu._npu_reshape_and_cache`** — KV cache update - **`torch_npu.npu_rms_norm`** / `npu_add_rms_norm` — Layer normalization - **`torch_npu.npu_swiglu`** — Fused SiLU + Mul activation