# vllm-npu-plugin

Ascend NPU platform plugin for vLLM v0.11.0.

## Overview

This package registers as an out-of-tree vLLM platform plugin via the `vllm.platform_plugins` entry-point group, enabling vLLM to run on Huawei Ascend NPU devices.

### Components

| Module | Description |
|---|---|
| `vllm_npu/platform.py` | `NPUPlatform` — device management, attention backend routing, config adaptation |
| `vllm_npu/distributed/communicator.py` | `NPUCommunicator` — HCCL-based distributed communication |
| `vllm_npu/attention/attention_v1.py` | `AscendAttentionBackend` — FlashAttention NPU kernels (prefill + decode) |
| `vllm_npu/worker/worker_v1.py` | `NPUWorker` — NPU device initialization and memory profiling |
| `vllm_npu/ops/` | NPU-optimized ops (SiLU+Mul, RMS norm, rotary embedding) |

## Prerequisites

- **Hardware**: Huawei Ascend 910B/910C or compatible NPU
- **Software**:
  - CANN (Compute Architecture for Neural Networks) 8.0+
  - `torch_npu` matching your PyTorch version
  - vLLM v0.11.0 (installed from source)

## Installation

```bash
# 1. Ensure vLLM v0.11.0 is installed with the feat/ascend-npu-adapt-v0.11.0 branch
cd /path/to/vllm
pip install -e .

# 2. Install this plugin
cd /path/to/vllm_npu_plugin
pip install -e .
```

## Verification

```bash
# Verify plugin is discoverable
python -c "
from vllm.plugins import load_plugins_by_group
plugins = load_plugins_by_group('vllm.platform_plugins')
print('Discovered plugins:', list(plugins.keys()))
assert 'npu' in plugins, 'NPU plugin not found!'
print('NPU plugin registered successfully!')
"

# Verify platform detection (requires NPU hardware)
python -c "
from vllm.platforms import current_platform
print(f'Current platform: {current_platform}')
print(f'Device type: {current_platform.device_type}')
"
```

## Usage

Once installed, vLLM will automatically detect the NPU platform if Ascend hardware is available:

```bash
# Run inference on NPU
python -m vllm.entrypoints.openai.api_server \
    --model /path/to/model \
    --tensor-parallel-size 1 \
    --block-size 128
```

## Architecture

```
vllm (v0.11.0)                    vllm-npu-plugin
┌─────────────────┐               ┌─────────────────────┐
│ Platform Plugin  │──entry_point──│ register()           │
│ Discovery        │               │   → NPUPlatform      │
├─────────────────┤               ├─────────────────────┤
│ AttentionBackend │◄──routing─────│ AscendAttentionBackend │
│ Interface        │               │  ├─ npu_fusion_attention│
│                  │               │  └─ npu_incre_flash_attn│
├─────────────────┤               ├─────────────────────┤
│ Worker Interface │◄──worker_cls──│ NPUWorker            │
│                  │               │  ├─ HCCL distributed  │
│                  │               │  └─ NPU memory mgmt   │
└─────────────────┘               └─────────────────────┘
```

## Key API References

- **`torch_npu.npu_fusion_attention`** — Fused multi-head attention (prefill)
- **`torch_npu.npu_incre_flash_attention`** — Incremental flash attention (decode)
- **`torch_npu._npu_reshape_and_cache`** — KV cache update
- **`torch_npu.npu_rms_norm`** / `npu_add_rms_norm` — Layer normalization
- **`torch_npu.npu_swiglu`** — Fused SiLU + Mul activation