title	Heterogeneous PD Disaggregation on CUDA+Gaudi Multi‑Node System

Heterogeneous PD Disaggregation on CUDA+Gaudi Multi‑Node System

Overview

PD Disaggregation splits model execution into prefill and decode stages. This separation allows for heterogeneous deployments, with different devices for prefill and decode stages.

Prerequisites

Docker containers must run with --privileged --net=host --ipc=host to expose InfiniBand devices and UCX transports.
Tested images:
- CUDA node: nvidia/cuda:12.8.1-cudnn-devel-ubuntu24.04
- Gaudi node: vault.habana.ai/gaudi-docker/1.22.2/ubuntu24.04/habanalabs/pytorch-installer-2.7.1
Each node requires vLLM; install vLLM Gaudi only on the Gaudi node.

Installation

The following installation is supported with vLLM and vLLM Gaudi versions:

vLLM commit hash / branch	vLLM-Gaudi commit hash / branch
`f21f5ea38c6fa0e824bc00d5762d17e049199cd3`	`e26948506df2ec861fd5280f0c1eefdb97a5d956`

CUDA Node Installation

# Install prerequisites
apt install -y python3.12 python3-pip git

# Install default NIXL (comes with UCX)
pip3 install nixl[cu12]

# Install vLLM from source
git clone https://github.com/vllm-project/vllm
cd vllm
git checkout $vllm_commit
pip3 install -e .

Gaudi Node Installation

# Install NIXL with UCX support
curl -O https://raw.githubusercontent.com/intel-staging/ucx/refs/heads/intel_gaudi_gdr_enabling_0/setup_nixl_ucx.sh
chmod +x setup_nixl_ucx.sh
./setup_nixl_ucx.sh

# Necessary environment variables for NIXL
export LD_LIBRARY_PATH=/tmp/ucx_install/lib:/opt/nvidia/nvda_nixl/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
export UCX_MEMTYPE_CACHE=0

# Install vLLM from source
git clone https://github.com/vllm-project/vllm
cd vllm && git checkout $vllm_commit
pip install -r <(sed '/^torch/d' requirements/build.txt)
VLLM_TARGET_DEVICE=empty pip install --no-build-isolation -e .

# Install vLLM Gaudi from source
git clone https://github.com/vllm-project/vllm-gaudi
cd vllm-gaudi && git checkout $vllm_gaudi_commit
pip install -e .

Examples

For each of the example cases below, the following parameters are generalized

Parameter	Purpose	Scope
`prefill_kv_layout`	KV cache layout for prefill (e.g., NHD or HND)	prefill
`prefill_block_size`	Block size for prefill	prefill
`prefill_port`	Prefill <-> proxy communication port	prefill/proxy
`decode_kv_layout`	KV cache layout for decode (e.g., NHD or HND)	decode
`decode_block_size`	Block size for decode	decode
`decode_port`	Decode <-> proxy communication port	decode/proxy
`port`	External port exposed by the proxy	proxy
`model`	Define model served, must be synced (i.e. `Qwen/Qwen3-0.6B`)	prefill/decode/proxy

The following toy proxy script is also valid for all examples and should be run on the server sending requests:

PREFILL_HOST="$cuda_ip"
PREFILL_PORT="$prefill_port"
DECODE_HOST="$gaudi_ip"
DECODE_PORT="$decode_port"
PROXY_PORT="$port"

python examples/nixl/toy_proxy_server.py \
  --port "$PROXY_PORT" \
  --prefiller-hosts "$PREFILL_HOST" \
  --prefiller-ports "$PREFILL_PORT" \
  --decoder-hosts "$DECODE_HOST" \
  --decoder-ports "$DECODE_PORT"

Case 1: CUDA Prefill + Gaudi Decode

CUDA launch script:

VLLM_KV_CACHE_LAYOUT="$prefill_kv_layout"
HOST="$cuda_ip"
PORT="$prefill_port"
BLOCK="$prefill_block_size"
MODEL="$model"

export VLLM_KV_CACHE_LAYOUT
export UCX_TLS=ib,rc,cuda_copy
export UCX_MEMTYPE_CACHE=0
export LD_LIBRARY_PATH="/tmp/ucx_install/lib:/opt/nvidia/nvda_nixl/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}"
export VLLM_NIXL_SIDE_CHANNEL_HOST=$HOST
export VLLM_NIXL_DEVICE_TO_DEVICE=1
export VLLM_USE_V1=1

vllm serve "$MODEL" \
  --port "$PORT" \
  --gpu-memory-utilization 0.8 \
  --block-size "$BLOCK" \
  --kv-transfer-config '{
  "kv_connector": "NixlConnector",
  "kv_role": "kv_both",
  "kv_buffer_device": "cuda",
  "kv_connector_extra_config": {"enforce_handshake_compat": false}
}'

Gaudi launch script:

VLLM_KV_CACHE_LAYOUT="$decode_kv_layout"
HOST="$gaudi_ip"
PORT="$decode_port"
BLOCK="$decode_block_size"
MODEL="$model"

export VLLM_KV_CACHE_LAYOUT
export UCX_TLS=ib,rc,gaudi_gdr
export UCX_MEMTYPE_CACHE=0
export LD_LIBRARY_PATH="/tmp/ucx_install/lib:/opt/nvidia/nvda_nixl/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}"
export VLLM_NIXL_SIDE_CHANNEL_HOST=$HOST
export VLLM_NIXL_DEVICE_TO_DEVICE=1
export VLLM_USE_V1=1

vllm serve "$MODEL" \
  --port "$PORT" \
  --gpu-memory-utilization 0.8 \
  --block-size "$BLOCK" \
  --kv-transfer-config '{
  "kv_connector": "NixlConnector",
  "kv_role": "kv_both",
  "kv_buffer_device": "hpu",
  "enable_permute_local_kv": "True",
  "kv_connector_extra_config": {"enforce_handshake_compat": false}
}'

Note

The installation process detailed in the installation section only supports either heterogeneous block size OR heterogeneous kv layout. #30275 implements support for heterogeneous block size AND kv layout. To enable this, set the vLLM branch to dev/decode_KV_post_process or cherry-pick the commits on both nodes before installing via pip install.

Case 2: Gaudi Prefill + CUDA Decode

Important

This case is enabled with #30448. To enable this example, set the vLLM branch to dev/prefill_KV_process or cherry-pick the commits on both nodes before installing via pip install.

Note

When using agreed_block_size, set it to the GCD of the prefill and decode --block-size values so both sides share a compatible KV block size.

CUDA launch script:

VLLM_KV_CACHE_LAYOUT="$decode_kv_layout"
HOST="$cuda_ip"
PORT="$decode_port"
BLOCK="$decode_block_size"
MODEL="$model"

export VLLM_KV_CACHE_LAYOUT
export UCX_TLS=ib,rc,cuda_copy
export UCX_MEMTYPE_CACHE=0
export LD_LIBRARY_PATH="/tmp/ucx_install/lib:/opt/nvidia/nvda_nixl/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}"
export VLLM_NIXL_SIDE_CHANNEL_HOST=$HOST
export VLLM_NIXL_DEVICE_TO_DEVICE=1
export VLLM_USE_V1=1

vllm serve "$MODEL" \
  --port "$PORT" \
  --gpu-memory-utilization 0.8 \
  --block-size "$BLOCK" \
  --kv-transfer-config '{
  "kv_connector": "NixlConnector",
  "kv_role": "kv_both",
  "kv_buffer_device": "cuda",
  "kv_connector_extra_config": {"enforce_handshake_compat": false, "agreed_block_size": 16}
}'

Gaudi launch script:

VLLM_KV_CACHE_LAYOUT="$prefill_kv_layout"
HOST="$gaudi_ip"
PORT="$prefill_port"
BLOCK="$prefill_block_size"
MODEL="$model"

export VLLM_KV_CACHE_LAYOUT
export UCX_TLS=ib,rc,gaudi_gdr
export UCX_MEMTYPE_CACHE=0
export LD_LIBRARY_PATH="/tmp/ucx_install/lib:/opt/nvidia/nvda_nixl/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}"
export VLLM_NIXL_SIDE_CHANNEL_HOST=$HOST
export VLLM_NIXL_DEVICE_TO_DEVICE=1
export VLLM_USE_V1=1

vllm serve "$MODEL" \
  --port "$PORT" \
  --gpu-memory-utilization 0.8 \
  --block-size "$BLOCK" \
  --no-enable-prefix-caching \
  --kv-transfer-config '{
  "kv_connector": "NixlConnector",
  "kv_role": "kv_both",
  "kv_buffer_device": "hpu",
  "enable_permute_local_kv": "True",
  "kv_connector_extra_config": {"enforce_handshake_compat": false, "agreed_block_size": 16}
}'

Validation

Run the request below from the proxy host to confirm end-to-end connectivity and kv cache integrity for any example:

MODEL="$model"
PROXY_PORT="$port"

curl http://localhost:$PROXY_PORT/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL"'",
    "prompt": "Mark Elliot Zuckerberg is an American businessman who co-founded the social media service Facebook and its parent company Meta Platforms, of which he is the chairman, chief executive officer, and controlling shareholder. Zuckerberg has been the subject of multiple lawsuits regarding the creation and ownership of the website as well as issues such as user privacy. Born in White Plains, New York, Zuckerberg briefly attended Harvard College, where he launched Facebook in February 2004 with his roommates Eduardo Saverin, Andrew McCollum, Dustin Moskovitz and Chris Hughes. Zuckerberg took the company public in May 2012 with majority shares. He became the worlds youngest self-made billionaire[a] in 2008, at age 23, and has consistently ranked among the worlds wealthiest individuals. According to Forbes, Zuckerbergs estimated net worth stood at US$221.2 billion as of May 2025, making him the second-richest individual in the world.",
    "max_tokens": 100,
    "temperature": 0
  }'

Example output:

{
  "id": "cmpl-7039b3dc-49b9-4943-84a1-6889d2a5c2cf",
  "object": "text_completion",
  "created": 1766014426,
  "model": "Qwen/Qwen3-0.6B",
  "choices": [
    {
      "index": 0,
      "text": " He has been the subject of multiple lawsuits regarding the creation and ownership of the website as well as issues such as user privacy. He has also been the subject of criticism for his actions in the past, including the use of his company's resources to support the education of his children, and for his actions in the past, including the use of his company's resources to support the education of his children, and for his actions in the past, including the use of his company's resources to support the education",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 199,
    "total_tokens": 299,
    "completion_tokens": 100,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heterogeneous PD Disaggregation on CUDA+Gaudi Multi‑Node System

Overview

Prerequisites

Installation

CUDA Node Installation

Gaudi Node Installation

Examples

Case 1: CUDA Prefill + Gaudi Decode

Case 2: Gaudi Prefill + CUDA Decode

Validation

FilesExpand file tree

heterogeneous_pd.md

Latest commit

History

heterogeneous_pd.md

File metadata and controls

Heterogeneous PD Disaggregation on CUDA+Gaudi Multi‑Node System

Overview

Prerequisites

Installation

CUDA Node Installation

Gaudi Node Installation

Examples

Case 1: CUDA Prefill + Gaudi Decode

Case 2: Gaudi Prefill + CUDA Decode

Validation