Skip to content

fix(validator): restore local image pull policy and optimize GPU CI#505

Merged
mchmarny merged 2 commits intoNVIDIA:mainfrom
yuanchen8911:fix/validator-image-pull-policy
Apr 8, 2026
Merged

fix(validator): restore local image pull policy and optimize GPU CI#505
mchmarny merged 2 commits intoNVIDIA:mainfrom
yuanchen8911:fix/validator-image-pull-policy

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 8, 2026

Summary

Three fixes for GPU test reliability, addressing failures that have occurred since April 2:

  1. Restore local image pull policy — PR feat(validator): add --node-selector and --toleration flags for validation workload scheduling #444 accidentally removed the ko.local/kind.local prefix check from imagePullPolicy(), causing validator Jobs to use PullAlways for locally-loaded images → ImagePullBackOff → 100% failure rate.

  2. Optimize image loading — Reduce smoke-test base image from cuda:runtime (~1.8GB) to cuda:base (~250MB), load only onto worker nodes (skip control-plane to avoid kind load hangs on 3-node clusters), and add timeout + retry with warning logs.

  3. Optimize pod-autoscaling conformance check — Remove scale-down behavioral test and reduce polling timeouts. Worst-case runtime reduced from ~9min to ~3min, helping GPU CI tests stay within their job-level timeout budgets.

Root Cause

PR #444 refactored imagePullPolicy() and removed the local image prefix check:

Before #444 (working):

if strings.HasPrefix(img, "ko.local") || ... {
    return corev1.PullIfNotPresent  // local images
}

After #444 (broken):

if strings.HasSuffix(d.entry.Image, ":latest") {
    return corev1.PullAlways  // ko.local:latest hits this branch
}

Result: ko.local/aicr-validators/conformance:latestPullAlways → no registry exists → ImagePullBackOff → 5-min Job timeout → all validators report status=other.

Fix Details

1. imagePullPolicy restoration

Restore the local image prefix check. Side-loaded images (ko.local, kind.local) use PullNever. All other images (including localhost:5001 used by the e2e registry flow) follow standard policy.

2. Image loading optimization

Change Before After
Smoke-test base image cuda:runtime (~1.8GB) cuda:base (~250MB)
kind load target All nodes (incl. control-plane) Worker nodes only
kind load timeout None (hangs indefinitely) 5min + retry with ::warning::
3-node cluster transfer ~5.4GB (1.8GB × 3) ~500MB (250MB × 2 workers)

3. pod-autoscaling optimization

Parameter Before After Savings
Custom metrics poll 12 attempts (2 min) 6 attempts (1 min) 1 min
HPA scaling intent timeout 3 min 1 min 2 min
Deployment scale-up timeout 2 min 1 min 1 min
Scale-down behavioral test Full test (2 min) Removed 2 min
Worst-case total ~9 min ~3 min ~6 min

Scale-up alone proves the full metrics pipeline (DCGM → Prometheus → adapter → HPA → Deployment controller). Scale-down tests a different path (HPA downscale stabilization) that is not a CNCF conformance requirement.

Evidence

GPU test pass/fail history on main:

Date GPU Tests
Mar 29 – Apr 2 All passing
Apr 2 (after #444 merged) 100% failure (ImagePullBackOff)
Apr 3 – Apr 7 0% pass rate

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Validator (pkg/validator)
  • Other: CI build action (.github/actions/aicr-build), conformance validator (validators/conformance)

Testing

go test -race -v ./pkg/validator/job/... -run TestImagePullPolicy  # 6 cases pass
go test -race ./pkg/defaults/...  # pass
golangci-lint run ./validators/conformance/... ./pkg/defaults/... ./pkg/validator/job/...  # 0 issues

Image loading and conformance optimizations can only be fully validated on GPU CI runners.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 added bug Something isn't working area/ci area/validator labels Apr 8, 2026
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner April 8, 2026 00:19
@github-actions github-actions bot added size/S and removed area/ci labels Apr 8, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/validator-image-pull-policy branch from b32ea38 to 66cd159 Compare April 8, 2026 00:21
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner April 8, 2026 00:21
@yuanchen8911 yuanchen8911 changed the title fix(validator): restore local image pull policy for ko.local/kind.local fix(validator): restore local image pull policy and optimize image loading Apr 8, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/validator-image-pull-policy branch from 66cd159 to 5c3c181 Compare April 8, 2026 00:25
@yuanchen8911 yuanchen8911 requested review from mchmarny and xdu31 April 8, 2026 00:25
dims
dims previously approved these changes Apr 8, 2026
Copy link
Copy Markdown
Collaborator

@dims dims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yuanchen8911 yuanchen8911 force-pushed the fix/validator-image-pull-policy branch 2 times, most recently from 270546b to f5a9013 Compare April 8, 2026 00:42
@yuanchen8911 yuanchen8911 requested a review from dims April 8, 2026 00:48
@yuanchen8911 yuanchen8911 force-pushed the fix/validator-image-pull-policy branch from f5a9013 to 218753c Compare April 8, 2026 03:02
@github-actions github-actions bot added size/M and removed size/S labels Apr 8, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/validator-image-pull-policy branch 4 times, most recently from f0b14d5 to 555d737 Compare April 8, 2026 05:01
@yuanchen8911 yuanchen8911 changed the title fix(validator): restore local image pull policy and optimize image loading fix(validator): restore local image pull policy and optimize GPU CI Apr 8, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/validator-image-pull-policy branch 2 times, most recently from 08ce907 to 101cdba Compare April 8, 2026 05:10
@yuanchen8911 yuanchen8911 force-pushed the fix/validator-image-pull-policy branch 2 times, most recently from 8a0af34 to c258bf2 Compare April 8, 2026 05:16
…ading

Two fixes for GPU test reliability:

1. Restore the local image prefix check in imagePullPolicy() that was
   accidentally removed in NVIDIA#444. Without this check, ko.local images
   tagged :latest use PullAlways, kubelet enters ImagePullBackOff (no
   registry exists for ko.local), and validator Jobs time out.
   Side-loaded images (ko.local, kind.local) now use PullNever.

2. Optimize kind load: reduce smoke-test base image from cuda:runtime
   (~1.8GB) to cuda:base (~250MB) — only nvidia-smi is needed. Add
   timeout (5min) + retry with warning logs to prevent indefinite hangs.

Fixes GPU Inference, Conformance, and Training test failures since
April 2.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@mchmarny mchmarny enabled auto-merge (squash) April 8, 2026 12:38
@mchmarny mchmarny merged commit 49d287f into NVIDIA:main Apr 8, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants