Skip to content

[CI] Add Docker source rebuild environment#3192

Draft
ethanwee1 wants to merge 6 commits intodevelopfrom
docker-full-clone-build-env
Draft

[CI] Add Docker source rebuild environment#3192
ethanwee1 wants to merge 6 commits intodevelopfrom
docker-full-clone-build-env

Conversation

@ethanwee1
Copy link
Copy Markdown

@ethanwee1 ethanwee1 commented Apr 29, 2026

Updates the portable PyTorch Docker workflow so the generated images can be used both for the existing wheel-based environment and for local/source rebuild workflows.

Changes

  • Builds images from a full PyTorch checkout with complete history and recursive submodules, then copies the source into the image at /tmp/pytorch.
  • Keeps the existing ROCm/TheRock wheel installation path intact while adding source-rebuild defaults such as ccache, GCC/G++, CMake compiler launchers, and PYTORCH_ROCM_ARCH.
  • Derives PYTORCH_ROCM_ARCH from the selected AMDGPU family using TheRock's build_tools/github_actions/expand_amdgpu_families.py metadata instead of maintaining a local mapping in the workflow.
  • Adds a Trivy critical-vulnerability scan before image push for both scheduled and manual Docker builds.
  • Updates Docker Hub authentication to use DOCKER_USERNAME and DOCKER_PAT secrets.

Testing

Latest dispatched Docker workflow run:
https://github.com/ROCm/pytorch/actions/runs/25394893743

Docker workflow runs on the PR branch:
https://github.com/ROCm/pytorch/actions/workflows/build_portable_linux_pytorch_dockers.yml?query=branch%3Adocker-full-clone-build-env

Parity log timeout detection

Adds parity report coverage for log-only timeout failures that appear as inline pytest KeyboardInterrupt retries, such as test_nn.py::TestNN::test_RNN_cpu_vs_cudnn_with_dropout, and keeps logs available for detection even when raw .txt log upload is disabled.

Validation run against the same input SHA dcf14ce8d73b7b43d23cc265d79215cbe2727774:
https://github.com/ROCm/pytorch/actions/runs/25449927177

Keep the existing torch/ROCm wheel installation flow, but prepare the
Docker image for source rebuilds by cloning PyTorch with full history and
submodules, installing ccache, and exporting the compiler/cache env vars
needed by PyTorch builds.

Derive PYTORCH_ROCM_ARCH from the selected AMDGPU family and pass it as a
Docker build arg (e.g. gfx94X-dcgpu -> gfx942, gfx950-dcgpu -> gfx950).
Move the copied PyTorch checkout from /workspace/pytorch to /tmp/pytorch
inside the Docker image while keeping the existing wheel installation flow.
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 29, 2026

Jenkins build for 20b7b328ac54f21c077e39d8aff1f821a45a40a4 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

*-*) PYTORCH_ROCM_ARCH="${GFX%%-*}" ;;
*) PYTORCH_ROCM_ARCH="${GFX}" ;;
esac
echo "pytorch_rocm_arch=${PYTORCH_ROCM_ARCH}" >> $GITHUB_OUTPUT
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if we can use https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/expand_amdgpu_families.py to get the most up-to-date mapping at any point of time

Use TheRock's expand_amdgpu_families.py helper to derive the concrete
PYTORCH_ROCM_ARCH build targets from the selected AMDGPU family instead
of maintaining ad hoc shell mappings in the Docker workflow.

The workflow checks out ROCm/TheRock shallowly and invokes the helper for
both scheduled matrix builds and manual dispatch builds.
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 29, 2026

Jenkins build for 25d8dde95844bd7b4158a365d79731e32494e9ea commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Run the Trivy critical-vulnerability scan before pushing portable PyTorch Docker images so the TheRock image workflow matches the existing security gate.
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented May 5, 2026

Jenkins build for 97ed6041c75f50aa6a18b940a6287e8495d6c641 commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

Update the portable Docker workflow to use the standard DOCKER_USERNAME and DOCKER_PAT secrets for Docker Hub authentication.
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented May 5, 2026

Jenkins build for 97ed6041c75f50aa6a18b940a6287e8495d6c641 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Detect log-only timeout failures that appear as inline pytest KeyboardInterrupt retries, and always download logs for parity scanning even when raw log upload is disabled.
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented May 6, 2026

Jenkins build for 8c2fac5c8f0d99527a80b6ea57d5569ea3ab84ba commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants