[CI] Add Docker source rebuild environment#3192
Conversation
Keep the existing torch/ROCm wheel installation flow, but prepare the Docker image for source rebuilds by cloning PyTorch with full history and submodules, installing ccache, and exporting the compiler/cache env vars needed by PyTorch builds. Derive PYTORCH_ROCM_ARCH from the selected AMDGPU family and pass it as a Docker build arg (e.g. gfx94X-dcgpu -> gfx942, gfx950-dcgpu -> gfx950).
Move the copied PyTorch checkout from /workspace/pytorch to /tmp/pytorch inside the Docker image while keeping the existing wheel installation flow.
|
Jenkins build for 20b7b328ac54f21c077e39d8aff1f821a45a40a4 commit finished as FAILURE |
| *-*) PYTORCH_ROCM_ARCH="${GFX%%-*}" ;; | ||
| *) PYTORCH_ROCM_ARCH="${GFX}" ;; | ||
| esac | ||
| echo "pytorch_rocm_arch=${PYTORCH_ROCM_ARCH}" >> $GITHUB_OUTPUT |
There was a problem hiding this comment.
Let's see if we can use https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/expand_amdgpu_families.py to get the most up-to-date mapping at any point of time
Use TheRock's expand_amdgpu_families.py helper to derive the concrete PYTORCH_ROCM_ARCH build targets from the selected AMDGPU family instead of maintaining ad hoc shell mappings in the Docker workflow. The workflow checks out ROCm/TheRock shallowly and invokes the helper for both scheduled matrix builds and manual dispatch builds.
|
Jenkins build for 25d8dde95844bd7b4158a365d79731e32494e9ea commit finished as FAILURE |
Run the Trivy critical-vulnerability scan before pushing portable PyTorch Docker images so the TheRock image workflow matches the existing security gate.
|
Jenkins build for 97ed6041c75f50aa6a18b940a6287e8495d6c641 commit finished as NOT_BUILT |
Update the portable Docker workflow to use the standard DOCKER_USERNAME and DOCKER_PAT secrets for Docker Hub authentication.
|
Jenkins build for 97ed6041c75f50aa6a18b940a6287e8495d6c641 commit finished as FAILURE |
Detect log-only timeout failures that appear as inline pytest KeyboardInterrupt retries, and always download logs for parity scanning even when raw log upload is disabled.
|
Jenkins build for 8c2fac5c8f0d99527a80b6ea57d5569ea3ab84ba commit finished as FAILURE |
Updates the portable PyTorch Docker workflow so the generated images can be used both for the existing wheel-based environment and for local/source rebuild workflows.
Changes
/tmp/pytorch.ccache, GCC/G++, CMake compiler launchers, andPYTORCH_ROCM_ARCH.PYTORCH_ROCM_ARCHfrom the selected AMDGPU family using TheRock'sbuild_tools/github_actions/expand_amdgpu_families.pymetadata instead of maintaining a local mapping in the workflow.DOCKER_USERNAMEandDOCKER_PATsecrets.Testing
Latest dispatched Docker workflow run:
https://github.com/ROCm/pytorch/actions/runs/25394893743
Docker workflow runs on the PR branch:
https://github.com/ROCm/pytorch/actions/workflows/build_portable_linux_pytorch_dockers.yml?query=branch%3Adocker-full-clone-build-env
Parity log timeout detection
Adds parity report coverage for log-only timeout failures that appear as inline pytest
KeyboardInterruptretries, such astest_nn.py::TestNN::test_RNN_cpu_vs_cudnn_with_dropout, and keeps logs available for detection even when raw.txtlog upload is disabled.Validation run against the same input SHA
dcf14ce8d73b7b43d23cc265d79215cbe2727774:https://github.com/ROCm/pytorch/actions/runs/25449927177