Skip to content

fix(provisioner): consolidate containerd v1/v2 templates into one#800

Open
ArangoGutierrez wants to merge 7 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/containerd-template-consolidation
Open

fix(provisioner): consolidate containerd v1/v2 templates into one#800
ArangoGutierrez wants to merge 7 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/containerd-template-consolidation

Conversation

@ArangoGutierrez
Copy link
Copy Markdown
Collaborator

Problem

The v2 binary-download template (containerdV2Template) was Debian-only and is broken in practice — see NVIDIA/gpu-operator#2396 where Tariq reported that holodeck cannot provision an environment with containerRuntime.version: 2.2.3. The original December-2025 rationale for v2 ("v1 didn't work for containerd 2.x") is obsolete: the containerd.io apt/dnf package now ships 2.x and works on debian, amazon, and rhel families.

Approach

Single template path. Both v1.x and v2.x render through the OS-aware containerd.io package template. Drops containerdV2Template, containerdV2Tmpl, the MajorVersion field on the Containerd struct, and the major-version dispatch branch in Execute(). The bare-major "2""2.0.0" coercion is preserved for backward compatibility (apt rejects containerd.io=2-1). Git and Latest source paths are not touched.

Commits

SHA Type Subject
094884e1 test Assert v2.x uses the unified containerd.io template (RED)
f6cea29f test Remove obsolete contracts for v2 binary-download path
697a785a fix Route v2.x through containerd.io package (GREEN)
1841754f test Drop MajorVersion assertions
5e7afc10 refactor Remove dead V2 template and MajorVersion field
7727dcb7 test Cover all OS-family branches of the unified template
7f276a39 test(e2e) Add containerd consolidation 2x2 matrix manifests

Testing

Unit tests, vet, and golangci-lint are clean.

E2E matrix on g4dn.xlarge in us-west-1:

# OS Requested Installed Path Service ctr version
1 Ubuntu 22.04 1.7.27 containerd.io 1.7.27 apt active Client/Server 1.7.27
2 Ubuntu 22.04 2.2.3 containerd.io v2.2.3 apt active Client/Server v2.2.3
3 Amazon Linux 2023 1.7.27 2.2.3+unknown (fallback) dnf active Client/Server 2.2.3+unknown
4 Amazon Linux 2023 2.2.3 2.2.3+unknown dnf active Client/Server 2.2.3+unknown

Verbatim verification output captured at the time of the run:

Cell 1 — Ubuntu + 1.7.27
=== containerd --version ===
containerd containerd.io 1.7.27 05044ec0a9a75232cad458027ca83437aae3f4da
=== systemctl is-active containerd ===
active
=== sudo ctr version ===
Client:
  Version:  1.7.27
  Revision: 05044ec0a9a75232cad458027ca83437aae3f4da
  Go version: go1.23.7

Server:
  Version:  1.7.27
  Revision: 05044ec0a9a75232cad458027ca83437aae3f4da
Cell 2 — Ubuntu + 2.2.3 (the fix)
=== containerd --version ===
containerd containerd.io v2.2.3 77c84241c7cbdd9b4eca2591793e3d4f4317c590
=== systemctl is-active containerd ===
active
=== sudo ctr version ===
Client:
  Version:  v2.2.3
  Revision: 77c84241c7cbdd9b4eca2591793e3d4f4317c590
  Go version: go1.25.9

Server:
  Version:  v2.2.3
  Revision: 77c84241c7cbdd9b4eca2591793e3d4f4317c590
Cell 3 — Amazon Linux 2023 + 1.7.27 (fell back to latest)
=== containerd --version ===
containerd github.com/containerd/containerd/v2 2.2.3+unknown
=== systemctl is-active containerd ===
active
=== sudo ctr version ===
Client:
  Version:  2.2.3+unknown
  Go version: go1.25.9 X:nodwarf5
Server:
  Version:  2.2.3+unknown
Cell 4 — Amazon Linux 2023 + 2.2.3 (previously untested combo)
=== /etc/os-release ===
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
=== containerd --version ===
containerd github.com/containerd/containerd/v2 2.2.3+unknown
=== systemctl is-active containerd ===
active
=== sudo ctr version ===
Client:
  Version:  2.2.3+unknown
Server:
  Version:  2.2.3+unknown

Notable

  • Cell 3 fell back from 1.7.27 to 2.2.3 because the AL2023 dnf repo only carries 2.x. This is the existing v1-template "fall back to latest if version not found" semantics, preserved by this PR. If a user pins 1.7.27 on AL2023, they currently get 2.x silently; making pinning strict is out of scope and tracked separately.
  • Cell 4 (AL2023 + containerd 2.x) had not been provisioning-tested before this PR — the old V2 template was Debian-only.
  • AL2023 cells require auth.username: ec2-user in the Environment manifest; included in the e2e fixtures here.
  • containerd 2.x emits a bin_dir/bin_dirs deprecation warning on every ctr invocation. Unrelated to this PR; tracked separately.

Refs

  • Closes the dispatch bug surfaced in NVIDIA/gpu-operator#2396
  • Spec: docs/plans/2026-04-29-containerd-template-consolidation-design.md
  • Plan: docs/plans/2026-04-29-containerd-template-consolidation-plan.md

@coveralls
Copy link
Copy Markdown

coveralls commented May 2, 2026

Coverage Report for CI Build 25261481429

Coverage decreased (-0.03%) to 47.742%

Details

  • Coverage decreased (-0.03%) from the base build.
  • Patch coverage: 1 uncovered change across 1 file (4 of 5 lines covered, 80.0%).
  • 1 coverage regression across 1 file.

Uncovered Changes

File Changed Covered %
pkg/provisioner/templates/containerd.go 5 4 80.0%

Coverage Regressions

1 previously-covered line in 1 file lost coverage.

File Lines Losing Coverage Coverage
pkg/provisioner/templates/containerd.go 1 84.06%

Coverage Stats

Coverage Status
Relevant Lines: 11026
Covered Lines: 5264
Line Coverage: 47.74%
Coverage Strength: 0.53 hits per line

💛 - Coveralls

Copy link
Copy Markdown
Collaborator Author

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review: 4-cell E2E matrix captured in PR body (Cells 1, 2 verified containerd.io apt path on Ubuntu; Cells 3, 4 verified dnf path on Amazon Linux 2023). go test, go vet, and golangci-lint are clean. DCO + GPG signatures on all 7 commits. Awaiting QA promotion to ready-for-review.

Copy link
Copy Markdown
Collaborator Author

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review: 4-cell E2E matrix captured in PR body (Cells 1, 2 verified containerd.io apt path on Ubuntu; Cells 3, 4 verified dnf path on Amazon Linux 2023). go test, go vet, and golangci-lint are clean. DCO + GPG signatures on all 7 commits. Awaiting QA promotion to ready-for-review.

@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review May 2, 2026 19:13
…io template

Adds TestContainerd_Execute_Version2x. Fails today because Execute()
dispatches v2.x to the binary-download containerdV2Template. Will pass
once the dispatch is collapsed into a single template path.

Refs: NVIDIA/gpu-operator#2396
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Drops TestContainerd_Execute_Version2 and the v2-specific branches in
TestContainerd_Execute_CommonElements. Those assertions describe the
binary-download path (RUNC_VERSION pin, github.com tarball, explicit
modprobe/sysctl) which is being removed in the next commit.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The v2 binary-download template was Debian-only and is broken in
practice (NVIDIA/gpu-operator#2396). The containerd.io apt/dnf package
now ships 2.x and works on debian, amazon, and rhel families. Both
v1.x and v2.x now render through the unified package template.

Fixes: NVIDIA/gpu-operator#2396
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The MajorVersion struct field is being removed. These assertions are
the last consumers.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…rsion

Removes containerdV2Template, the containerdV2Tmpl pre-compiled var,
and the MajorVersion field on the Containerd struct. The bare-major
"2" -> "2.0.0" coercion is preserved for backward compatibility.
Git and latest source paths are unchanged.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Adds assertions for /etc/apt/keyrings/docker.gpg (debian),
docker-ce.repo (rhel), and "Unsupported OS family" (default arm),
satisfying the design doc's mutation invariant: deleting any branch
of the unified template must cause a unit test to fail.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Cells: (Ubuntu, Amazon Linux 2023) x (containerd 1.7.27, 2.2.3).
Cell 4 (Amazon Linux + 2.2.3) is previously untested -- the v2
binary-download path was Debian-only.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez force-pushed the fix/containerd-template-consolidation branch from 7f276a3 to 82b7e0d Compare May 2, 2026 20:45
@ArangoGutierrez
Copy link
Copy Markdown
Collaborator Author

Please review @tariq1890 / @cdesiniotis

Copy link
Copy Markdown
Contributor

@tariq1890 tariq1890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much @ArangoGutierrez !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants