fix(validator): restore local image pull policy and optimize GPU CI#505
Merged
mchmarny merged 2 commits intoNVIDIA:mainfrom Apr 8, 2026
Merged
Conversation
b32ea38 to
66cd159
Compare
66cd159 to
5c3c181
Compare
270546b to
f5a9013
Compare
f5a9013 to
218753c
Compare
f0b14d5 to
555d737
Compare
7 tasks
08ce907 to
101cdba
Compare
8a0af34 to
c258bf2
Compare
…ading Two fixes for GPU test reliability: 1. Restore the local image prefix check in imagePullPolicy() that was accidentally removed in NVIDIA#444. Without this check, ko.local images tagged :latest use PullAlways, kubelet enters ImagePullBackOff (no registry exists for ko.local), and validator Jobs time out. Side-loaded images (ko.local, kind.local) now use PullNever. 2. Optimize kind load: reduce smoke-test base image from cuda:runtime (~1.8GB) to cuda:base (~250MB) — only nvidia-smi is needed. Add timeout (5min) + retry with warning logs to prevent indefinite hangs. Fixes GPU Inference, Conformance, and Training test failures since April 2. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
c258bf2 to
a84164d
Compare
23 tasks
mchmarny
approved these changes
Apr 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three fixes for GPU test reliability, addressing failures that have occurred since April 2:
Restore local image pull policy — PR feat(validator): add --node-selector and --toleration flags for validation workload scheduling #444 accidentally removed the
ko.local/kind.localprefix check fromimagePullPolicy(), causing validator Jobs to usePullAlwaysfor locally-loaded images →ImagePullBackOff→ 100% failure rate.Optimize image loading — Reduce smoke-test base image from
cuda:runtime(~1.8GB) tocuda:base(~250MB), load only onto worker nodes (skip control-plane to avoidkind loadhangs on 3-node clusters), and add timeout + retry with warning logs.Optimize pod-autoscaling conformance check — Remove scale-down behavioral test and reduce polling timeouts. Worst-case runtime reduced from ~9min to ~3min, helping GPU CI tests stay within their job-level timeout budgets.
Root Cause
PR #444 refactored
imagePullPolicy()and removed the local image prefix check:Before #444 (working):
After #444 (broken):
Result:
ko.local/aicr-validators/conformance:latest→PullAlways→ no registry exists →ImagePullBackOff→ 5-min Job timeout → all validators reportstatus=other.Fix Details
1. imagePullPolicy restoration
Restore the local image prefix check. Side-loaded images (
ko.local,kind.local) usePullNever. All other images (includinglocalhost:5001used by the e2e registry flow) follow standard policy.2. Image loading optimization
cuda:runtime(~1.8GB)cuda:base(~250MB)kind loadtargetkind loadtimeout::warning::3. pod-autoscaling optimization
Scale-up alone proves the full metrics pipeline (DCGM → Prometheus → adapter → HPA → Deployment controller). Scale-down tests a different path (HPA downscale stabilization) that is not a CNCF conformance requirement.
Evidence
GPU test pass/fail history on main:
Type of Change
Component(s) Affected
pkg/validator).github/actions/aicr-build), conformance validator (validators/conformance)Testing
Image loading and conformance optimizations can only be fully validated on GPU CI runners.
Checklist
make testwith-race)make lint)git commit -S)