Skip to content

Conversation

@ghandoura
Copy link

GAUDISW-242243

  • Multi-stage build: gaudi-base → gaudi-pytorch → vllm-final

Build arguments:

  • SYNAPSE_VERSION: Habana Synapse AI version (default: 1.22.1)
  • PT_VERSION: PyTorch version (default: 2.7.1)
  • VLLM_GAUDI_COMMIT: vllm-gaudi git commit/tag (default: main)
  • VLLM_PROJECT_COMMIT: vllm upstream commit (auto-detected if empty)
  • TORCH_TYPE: PyTorch type - 'upstream' or 'fork' (default: upstream)

Usage:
docker build --build-arg SYNAPSE_VERSION=1.23.0 -t vllm-gaudi:1.23.0 .

GAUDISW-242243

- Multi-stage build: gaudi-base → gaudi-pytorch → vllm-final

Build arguments:
- SYNAPSE_VERSION: Habana Synapse AI version (default: 1.22.1)
- PT_VERSION: PyTorch version (default: 2.7.1)
- VLLM_GAUDI_COMMIT: vllm-gaudi git commit/tag (default: main)
- VLLM_PROJECT_COMMIT: vllm upstream commit (auto-detected if empty)
- TORCH_TYPE: PyTorch type - 'upstream' or 'fork' (default: upstream)

Usage:
  docker build --build-arg SYNAPSE_VERSION=1.23.0 -t vllm-gaudi:1.23.0 .
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new multi-stage Dockerfile for building vLLM with Habana Gaudi support on RHEL 9.6 UBI. The Dockerfile enables building vLLM container images with configurable Synapse AI and PyTorch versions for Habana's Gaudi accelerators.

Key Changes:

  • Implements a three-stage build process: gaudi-base (system setup + Habana drivers), gaudi-pytorch (PyTorch installation), and vllm-final (vLLM installation and configuration)
  • Adds flexible build arguments for version management (SYNAPSE_VERSION, PT_VERSION, VLLM_GAUDI_COMMIT, VLLM_PROJECT_COMMIT, TORCH_TYPE)
  • Configures OpenShift-compatible non-root user setup for enterprise deployments

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ARG BASE_NAME=rhel9.6
ARG PT_VERSION=2.7.1
# can be upstream or fork
ARG TORCH_TYPE='upstream'
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove quotes around the default value. In Dockerfile ARG declarations, quotes become part of the value itself. Change to ARG TORCH_TYPE=upstream to avoid potential issues when this value is compared or used in conditions.

Suggested change
ARG TORCH_TYPE='upstream'
ARG TORCH_TYPE=upstream

Copilot uses AI. Check for mistakes.
git clone --depth 1 https://github.com/vllm-project/vllm-gaudi.git $VLLM_PATH2 && \
cd $VLLM_PATH2 && \
if [ -z "${VLLM_PROJECT_COMMIT}" ]; then \
VLLM_PROJECT_COMMIT=$(git show "origin/vllm/last-good-commit-for-vllm-gaudi:VLLM_STABLE_COMMIT" 2>/dev/null || echo "main") && \
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback error message 'main' is unhelpful when the git show command fails. Consider adding a warning message using echo >&2 \"Warning: Could not fetch last-good-commit, using main branch\" before the fallback to help with debugging.

Suggested change
VLLM_PROJECT_COMMIT=$(git show "origin/vllm/last-good-commit-for-vllm-gaudi:VLLM_STABLE_COMMIT" 2>/dev/null || echo "main") && \
VLLM_PROJECT_COMMIT=$(git show "origin/vllm/last-good-commit-for-vllm-gaudi:VLLM_STABLE_COMMIT" 2>/dev/null || { echo >&2 "Warning: Could not fetch last-good-commit, using main branch"; echo "main"; }) && \

Copilot uses AI. Check for mistakes.
# Clone and install vLLM
RUN set -e && \
mkdir -p $VLLM_PATH2 && \
git clone --depth 1 https://github.com/vllm-project/vllm-gaudi.git $VLLM_PATH2 && \
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shallow clone with --depth 1 may prevent checking out specific commits if VLLM_GAUDI_COMMIT references an older commit. Consider removing --depth 1 or add a full fetch before checkout at line 179 to ensure arbitrary commits can be checked out.

Suggested change
git clone --depth 1 https://github.com/vllm-project/vllm-gaudi.git $VLLM_PATH2 && \
git clone https://github.com/vllm-project/vllm-gaudi.git $VLLM_PATH2 && \

Copilot uses AI. Check for mistakes.
echo "Clone vllm-project/vllm and use configured or last good commit hash" && \
git clone https://github.com/vllm-project/vllm.git $VLLM_PATH && \
cd $VLLM_PATH && \
git remote add upstream https://github.com/vllm-project/vllm.git && \
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTTP configuration values (524288000 bytes buffer, 1000 bytes/sec speed limit, 600 second timeout) lack explanation for why these specific values are needed. Add a comment explaining these are set to handle large repository fetches over slow connections.

Suggested change
git remote add upstream https://github.com/vllm-project/vllm.git && \
git remote add upstream https://github.com/vllm-project/vllm.git && \
# The following git HTTP configuration values are set to handle large repository fetches over slow connections.

Copilot uses AI. Check for mistakes.
@ghandoura
Copy link
Author

tested using 1.22.1 synapse and both upstream and fork pytorch

@github-actions
Copy link

github-actions bot commented Dec 4, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants