diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md new file mode 100644 index 000000000..5006da405 --- /dev/null +++ b/docs/designs/026-preflight-checks.md @@ -0,0 +1,769 @@ +# ADR-026: Feature — Preflight Checks + +## Context + +GPU failures during training waste compute time. Running diagnostics before the workload starts catches bad GPUs early. + +Gang-wide NCCL tests require discovering all pods in a gang. Kubernetes 1.35 introduced `spec.workloadRef` as a native gang identifier, but users may also use Volcano, Kueue, or other schedulers with their own mechanisms. + +### Distinction from Health Monitors + +NVSentinel already has health monitors (GPU Health Monitor, Syslog Health Monitor) that detect GPU issues. This is different: + +| | Health Monitors | Preflight Checks | +|------------|------------------------|-------------------------------| +| When | Continuous | Once at pod start | +| Check type | Passive | Active diagnostics | +| Detects | Failures as they occur | Latent issues before starting | +| Purpose | Reactive remediation | Prevent bad starts | + +Preflight asks "is this GPU healthy enough to start?" Health monitors ask "did this GPU fail while running?" + +## Decision + +Implement a MutatingAdmissionWebhook that injects preflight check init containers into GPU pods in configured namespaces. + +### Key points + +- Injection trigger: GPU resources (extended resources or DRA claims) + namespace +- Gang discovery: Pluggable (supports `workloadRef`; can be extended to Volcano, Kueue .etc.) +- Resource detection: Configurable lists for extended resource names and DRA device classes + +## Architecture + +### Components + +``` +preflight/ +└── controller/ # Webhook + gang controller (controller-runtime) + ├── Dockerfile + ├── main.go + └── pkg/ + ├── webhook/ # Admission handler + ├── injection/ # Pod mutation, DRA detection + ├── gang/ # Gang discovery implementations + └── coordination/ # ConfigMap management + +preflight-checks/ +├── dcgm-diag/ +│ ├── Dockerfile +│ ├── main.go +│ └── pkg/ +│ +├── nccl-loopback/ +│ ├── Dockerfile +│ ├── nccl-topologies/ +│ ├── main.go +│ └── pkg/ +│ +└── nccl-allreduce/ + ├── Dockerfile + ├── nccl-topologies/ + ├── main.go + └── pkg/ +``` + +### Overall flow + +```mermaid +stateDiagram-v2 + [*] --> PodCreated: User creates GPU pod + + state "Webhook Injection" as Webhook { + PodCreated --> CheckGPU: Admission webhook triggered + CheckGPU --> Inject: GPU resources detected + CheckGPU --> Skip: No GPU resources + Skip --> [*]: Pod starts normally + Inject --> PodScheduled: Init containers injected + } + + state "Init Container Execution" as InitExec { + PodScheduled --> DCGMDiag: Run dcgm-diag + + state "DCGM Diag" as DCGMDiag { + [*] --> GetGPUUUIDs: nvidia-smi query + GetGPUUUIDs --> RemoteDiag: dcgmi diag via hostengine + RemoteDiag --> DCGMPass: All tests pass + RemoteDiag --> DCGMFail: Test failure + } + + DCGMPass --> NCCLLoopback: Next check + DCGMFail --> ReportFailure: HealthEvent + + state "NCCL Loopback" as NCCLLoopback { + [*] --> RunLoopback: all_reduce_perf -g N + RunLoopback --> CheckBW: Measure bandwidth + CheckBW --> LoopbackPass: BW >= threshold + CheckBW --> LoopbackFail: BW < threshold + } + + LoopbackPass --> GangCheck: Check if gang-wide enabled + LoopbackFail --> ReportFailure + + GangCheck --> NCCLAllReduce: nccl-allreduce enabled + GangCheck --> AllPassed: Single-node only + } + + state "Gang Coordination" as GangCoord { + NCCLAllReduce --> WaitPeers: Poll ConfigMap + WaitPeers --> PeersReady: All peers registered + WaitPeers --> GangTimeout: Timeout (10 min) + GangTimeout --> ReportTimeout: isFatal=false + + state "NCCL All-Reduce" as AllReduce { + PeersReady --> PyTorchInit: TCP bootstrap to master + PyTorchInit --> RunAllReduce: dist.all_reduce() + RunAllReduce --> AllReducePass: BW >= threshold + RunAllReduce --> AllReduceFail: BW < threshold or error + } + + AllReducePass --> AllPassed + AllReduceFail --> ReportFailure + } + + state "Failure Handling" as FailHandle { + ReportFailure --> SendHealthEvent: gRPC to Platform Connector + ReportTimeout --> SendHealthEvent + SendHealthEvent --> PlatformConnector: HealthEvent published + PlatformConnector --> FaultQuarantine: Cordon node + FaultQuarantine --> NodeDrainer: Drain workloads + NodeDrainer --> FaultRemediation: Based on recommendedAction + FaultRemediation --> [*]: Node remediated or escalate + } + + AllPassed --> MainContainerStart: Init success (exit 0) + MainContainerStart --> [*]: Workload runs +``` + +### MutatingWebhookConfiguration (sketch) + +```yaml +apiVersion: admissionregistration.k8s.io/v1 +kind: MutatingWebhookConfiguration +metadata: + name: preflight-injector +webhooks: + - name: preflight.nvsentinel.nvidia.com + clientConfig: + service: + name: preflight-injector + namespace: nvsentinel + path: /mutate-pod + rules: + - apiGroups: [""] + apiVersions: ["v1"] + resources: ["pods"] + operations: ["CREATE"] + namespaceSelector: + matchExpressions: + - key: kubernetes.io/metadata.name + operator: In + values: [] # Populated from Helm values + - key: kubernetes.io/metadata.name + operator: NotIn + values: [] # Excluded namespaces (systemNamespaces, nvsentinel, etc.) + failurePolicy: Fail + sideEffects: None + admissionReviewVersions: ["v1"] +``` + +## Resource detection and injection + +### Detection logic + +1. Extended resources (device plugins): check `resources.limits`/`resources.requests` for configured names (e.g. `nvidia.com/gpu`) +2. DRA: check `spec.resourceClaims`, resolve claim/template, match `deviceClassName` against configured list + +### Injected init containers (sketch) + +One init container per enabled check with be prepended to the pod's init containers: + +```yaml +initContainers: + - name: preflight-dcgm-diag + image: ghcr.io/nvidia/nvsentinel/preflight-dcgm-diag:v1 + env: + - name: DCGM_DIAG_LEVEL + value: "1" + - name: DCGM_HOSTENGINE_ADDR + value: "dcgm-hostengine.nvsentinel.svc:5555" + - name: PLATFORM_CONNECTOR_SOCKET + value: "unix:///var/run/nvsentinel.sock" + resources: + limits: + nvidia.com/gpu: 8 # Max across all containers + volumeMounts: + - name: platform-connector-socket + mountPath: /var/run + + - name: preflight-nccl-allreduce + image: ghcr.io/nvidia/nvsentinel/preflight-nccl-allreduce:v1 + env: + - name: NCCL_ALLREDUCE_THRESHOLD_GBPS + value: "5.0" + - name: GANG_TIMEOUT + value: "600s" + - name: MY_POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + resources: + limits: + nvidia.com/gpu: 8 + nvidia.com/mlnxnics: 4 + volumeMounts: + - name: platform-connector-socket + mountPath: /var/run + - name: preflight-gang-config # ConfigMap: peers, master_addr, rank, world_size + mountPath: /etc/preflight +``` + +### Resource handling + +- GPUs / extended resources: inject max across all containers +- Network / extended resources: inject max across all containers for configured names +- DRA: inject all referenced GPU/network claims into init container + +## Check types + +| Check | Scope | Coordination | +|------------------|-------------|--------------| +| `dcgm-diag` | Single node | None | +| `nccl-loopback` | Single node | None | +| `nccl-allreduce` | Gang-wide | ConfigMap | + +Third-party checks follow the same pattern: separate image, configured in Helm. + +### DCGM Diag + +Runs DCGM diagnostics on allocated GPUs via remote DCGM hostengine Service. + +**How it works:** +1. Init container gets GPU UUIDs: `nvidia-smi --query-gpu=uuid --format=csv,noheader` +2. Calls DCGM hostengine via Service: `dcgmi diag -r --host $DCGM_HOSTENGINE_ADDR -i ` +3. Parses results, maps failures to HealthEvents + +**Requirements:** +- DCGM hostengine DaemonSet running (privileged, with GPU access) +- DCGM Service exposing hostengine (port 5555) +- NetworkPolicy allowing init container → DCGM Service + +**Diag levels:** +- Level 1 (~30s): Quick hardware validation (memory, PCIe bandwidth) +- Level 2 (~2-3min): Extended tests (stress, targeted diagnostics) + +Init container remains unprivileged; hostengine performs diagnostics. + +### NCCL Loopback + +Tests intra-node GPU-to-GPU communication (NVLink/PCIe paths) without network. + +**How it works:** +1. Init container runs `all_reduce_perf` (from nccl-tests) with all allocated GPUs +2. Command: `all_reduce_perf -b 8 -e 256M -f 2 -g ` +3. Validates bandwidth meets threshold set in Helm values +4. No coordination needed — single node only + +**What it catches:** +- NVLink failures between GPUs +- PCIe bandwidth degradation +- GPU memory errors during collective ops + +**Requirements:** +- GPU allocation (device plugin) +- `nccl-tests` binary in checker image + +**Example output parsing:** +``` +# nccl-tests output format: +# size count type redop time algbw busbw + 8M 2097152 float sum 1.23 6.50 12.19 +``` +Checker validates `busbw` (bus bandwidth) against configured threshold. + +### NCCL All-Reduce (Gang-Wide) + +Tests cross-node GPU collective communication over RDMA/InfiniBand. + +**How it works:** +1. **Gang formation**: All pods register in shared ConfigMap (see Gang Coordination section) +2. **Wait for peers**: Each init container polls ConfigMap (mounted volume) until all peers registered +3. **Bootstrap via TCP**: Rank 0's IP from ConfigMap; PyTorch/NCCL handles handshake +4. **Run test**: Each init container runs PyTorch all-reduce independently; NCCL coordinates internally + +**Test script (PyTorch-based, no MPI needed):** +```python +import torch +import torch.distributed as dist +import os + +# Read from mounted ConfigMap +rank = int(os.environ['MY_RANK']) +world_size = int(os.environ['WORLD_SIZE']) +master_addr = os.environ['MASTER_ADDR'] # Rank 0's IP from ConfigMap + +# PyTorch handles NCCL bootstrap via TCP +dist.init_process_group( + backend='nccl', + init_method=f'tcp://{master_addr}:29500', + rank=rank, + world_size=world_size +) + +# Run all-reduce test, measure bandwidth +tensor = torch.ones(256 * 1024 * 1024, device='cuda') # 1GB +dist.all_reduce(tensor) +# ... measure time, calculate bandwidth, compare to threshold +``` + +Each init container runs independently. + +**What it catches:** +- InfiniBand/RDMA link failures +- Network topology misconfigurations +- Cross-node NVLink (when present) +- NCCL algorithm/protocol issues + +**Requirements:** +- Gang discovery (`workloadRef`, Volcano, or Kueue) +- Network device allocation (InfiniBand NICs) +- NCCL topology file (auto-detected or user-provided) + +**Timeout handling:** +- `GANG_TIMEOUT` sets max wait for all peers to register +- If timeout expires before gang forms → exit with `isFatal: false` (not a hardware issue) + +### Third-Party Checks + +Third-party checks follow the same pattern as built-in checks. Register in Helm: + +```yaml +preflight-injector: + checks: + - name: dcgm-diag + image: ghcr.io/nvidia/nvsentinel/preflight-dcgm-diag:v1 + - name: nccl-loopback + image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:v1 + - name: bandwidth-check # third-party + image: myregistry/bandwidth-check:v1 +``` + +**Check contract:** +- Exit codes: `0` (passed), `1` (check failed), `2` (config error) +- Report failures via gRPC to Platform Connector: + - Unix socket: `unix:///var/run/nvsentinel.sock` + - RPC: `HealthEventOccurredV1` (proto: `data-models/protobufs/health_event.proto`) + - Set `isFatal`, `recommendedAction`, `errorCode` in HealthEvent +- Webhook mounts: GPU devices, Platform Connector socket, network devices + +### Configuration + +Configured at deployment time via Helm values. No per-workload annotations. + +### Gang Discovery + +Gang discovery is pluggable. Given one pod, return all pods in the gang. + +**Interface:** +```go +type GangDiscoverer interface { + DiscoverGang(pod *corev1.Pod) ([]PeerInfo, error) +} + +type PeerInfo struct { + PodName string + PodIP string + NodeName string +} +``` + +**Implementations:** + +| Scheduler | Discovery chain | +|-----------------|--------------------------------------------------------------------------| +| K8s 1.35 native | Pod → `spec.workloadRef` → list pods with same ref | +| Volcano | Pod → `volcano.sh/pod-group` annotation → list pods with same annotation | +| Kueue | Pod → `kueue.x-k8s.io/workload-name` label → list pods with same label | +| Label-based | Pod → configurable labels → list pods with same labels | + +Controller selects implementation based on Helm config. If no gang identifier found, pod is treated as singleton (skip gang-wide tests). + +### Gang Coordination + +For gang-wide checks like `nccl-allreduce`, the preflight controller maintains a ConfigMap. Webhook mounts it as a volume; init containers read from filesystem. + +```mermaid +sequenceDiagram + participant C as Preflight Controller + participant K as Kubelet + participant P0 as Pod 0 Init + participant P1 as Pod 1 Init + + C->>C: Create ConfigMap (expected=2, master_addr=10.0.1.5) + C->>C: Update ConfigMap: add pod-0:10.0.1.5 + C->>C: Update ConfigMap: add pod-1:10.0.1.6 + + K->>P0: Sync ConfigMap to volume + K->>P1: Sync ConfigMap to volume + + P0->>P0: Read /etc/preflight/peers until len == expected + P1->>P1: Read /etc/preflight/peers until len == expected + + Note over P0,P1: Determine rank by sorting pod names + + P0->>P0: PyTorch init (rank=0, listens on :29500) + P1->>P0: PyTorch init (rank=1, connects to master_addr:29500) + P0->>P1: NCCL all_reduce over RDMA +``` + +**Flow:** +1. Controller creates/updates ConfigMap `preflight-` with `expected_count`, `peers`, `master_addr` +2. Webhook mounts ConfigMap as volume at `/etc/preflight/` +3. Init containers poll filesystem until all peers registered (kubelet syncs ~1 min) +4. Each pod determines rank by sorting pod names alphabetically +5. PyTorch connects to `master_addr` for NCCL bootstrap (TCP), then NCCL uses RDMA + +**ConfigMap structure:** +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: preflight-myworkload-group1 +data: + expected_count: "2" + master_addr: "10.0.1.5" # Rank 0's IP for PyTorch TCP bootstrap + peers: | + pod-0:10.0.1.5 + pod-1:10.0.1.6 +``` + +**Gang coordination timeout:** 10 minutes. If gang doesn't form, init fails with `isFatal: false` (not a hardware issue). + +### RBAC + +**Controller ClusterRole:** +```yaml +rules: + - apiGroups: [""] + resources: ["configmaps"] + verbs: ["get", "create", "patch"] + - apiGroups: [""] + resources: ["pods"] + verbs: ["get", "list", "watch"] + # Additional rules based on gang discoverer: + # workloadRef: scheduling.k8s.io/workloads (get) + # Volcano: scheduling.volcano.sh/podgroups (get) + # Kueue: kueue.x-k8s.io/workloads (get) +``` + +Controller only touches ConfigMaps with `preflight-` prefix (enforced by code). + +### DRA Integration + +For pods using Dynamic Resource Allocation (DRA), the webhook copies resource claim references to the init container. + +**Device claim detection:** +Webhook checks pod's `spec.resourceClaims`, retrieves each ResourceClaim or ResourceClaimTemplate, and matches `deviceClassName` against configured lists for GPUs and network devices: + +```yaml +# Helm values +preflight-injector: + gpuDetection: + # Extended resources (current, no DRA) + resourceNames: + - "nvidia.com/gpu" + + # DRA device classes (requires operator configuration) + deviceClasses: + - "gpu.nvidia.com" + - "nvidia.com/gpu" + # Operators add their DeviceClass names here +``` + +**Init container injection with DRA:** +```yaml +apiVersion: v1 +kind: Pod +spec: + # Pod-level claims + resourceClaims: + - name: gpu-claim + resourceClaimName: training-gpus + - name: rdma-claim + resourceClaimName: training-rdma + + initContainers: + - name: nvsentinel-preflight + resources: + claims: + - name: gpu-claim # References same GPU claim + - name: rdma-claim # References same network claim + + containers: + - name: main + resources: + claims: + - name: gpu-claim + - name: rdma-claim +``` + +**Detection logic:** +1. Check if pod uses extended resources (`nvidia.com/gpu`, `nvidia.com/mlnxnics`) → inject with max counts across all containers +2. Check if pod has DRA claims with matching `deviceClassName` → inject with all unique GPU and network claim references +3. If neither → skip injection + +Network devices (InfiniBand, RDMA) can be exposed via DRA claims or extended resources. Webhook uses same detection pattern for both. + +DRA device class names are not standardized. Operators configure `gpuDetection.deviceClasses` and `networkDetection.deviceClasses` to match cluster DeviceClass names. + +### Network Resources for NCCL Tests + +NCCL tests require access to RDMA/InfiniBand devices for efficient GPU-to-GPU communication. + +**Network device exposure methods:** + +1. **Extended resources (device plugins):** + - Example: `nvidia.com/mlnxnics` (common on GPU+IB clusters) + - Resource names are cluster-specific; configure `networkDetection.resourceNames` accordingly + +2. **DRA claims:** + - Network devices can also be exposed via DRA claims (DeviceClass names are cluster-specific) + - Webhook matches claim `deviceClassName` against `networkDetection.deviceClasses` + +**Webhook behavior for NCCL checks:** +If `nccl-loopback` or `nccl-allreduce` is enabled, webhook: +1. Copies all network device resources (extended resources using max count, or DRA claim references) +2. Scans all container env vars, copies those matching `ncclEnvPatterns` (glob patterns from Helm config) +3. Copies volume mounts referenced by `NCCL_TOPO_FILE` (if present) + +**NCCL topology file handling:** +The init container image includes common topology files for major cloud platforms: +``` +/opt/nvsentinel/nccl-topologies/ +├── azure-ndv4.xml +├── azure-ndv5.xml +├── aws-p5.48xlarge.xml +├── gcp-a3-mega.xml +└── oci-bm-gpu-a100.xml +``` + +**Topology selection priority:** +1. **User-provided**: Webhook checks if any container has `NCCL_TOPO_FILE` env var with a corresponding volume mount at that path → copy that volume mount to init container +2. **Auto-detect**: If no `NCCL_TOPO_FILE` + volume mount found, init container reads node label `node.kubernetes.io/instance-type`, maps to built-in topology file via Helm config +3. **Fallback**: If instance type unknown or not in mapping, don't set `NCCL_TOPO_FILE` (NCCL auto-detects topology) + +If pod has no network device resources, NCCL tests are skipped (DCGM diag runs). + +### Failure Behavior + +Init container exit codes: +- `0`: All checks passed +- `1`: Check failed +- `2`: Configuration error + +On failure: +- Pod stays in `Init:Error` state +- **HealthEvent created** via Platform Connector (same as health monitors) +- Kubernetes Event created with failure details +- Metrics incremented (`preflight_check_failures_total`) + +HealthEvent feeds into existing NVSentinel workflow (quarantine, correlation, etc). + +### Error to Recommended Action Mapping + +**DCGM Diag** : + +| Test | Result | Recommended Action | +|--------|--------|--------------------| +| Memory | `FAIL` | `CONTACT_SUPPORT` | +| PCIe | `FAIL` | `CONTACT_SUPPORT` | +| NVLink | `FAIL` | `CONTACT_SUPPORT` | +| Stress | `FAIL` | `RUN_DCGMEUD` | +| Any | `WARN` | `NONE` | + + +**NCCL Checks**: + +| Error | Recommended Action | +|-----------------------|--------------------| +| `NCCL_SYSTEM_ERROR` | `CONTACT_SUPPORT` | +| `NCCL_INTERNAL_ERROR` | `RUN_DCGMEUD` | +| `NCCL_INVALID_USAGE` | `NONE` | +| `NCCL_TIMEOUT` | `NONE` | +| `NCCL_REMOTE_ERROR` | `CONTACT_SUPPORT` | + +**isFatal determination**: +- DCGM diag `FAIL` → `isFatal: true` +- DCGM diag `WARN` → `isFatal: false` +- NCCL hardware errors (`SYSTEM_ERROR`, `INTERNAL_ERROR`, `REMOTE_ERROR`) → `isFatal: true` +- NCCL timeout/config errors → `isFatal: false` + +### Helm Values + +```yaml +preflight-injector: + enabled: false # Opt-in + + checks: + - name: dcgm-diag + image: ghcr.io/nvidia/nvsentinel/preflight-dcgm-diag:v1 + - name: nccl-loopback + image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:v1 + # - name: nccl-allreduce + # image: ghcr.io/nvidia/nvsentinel/preflight-nccl-allreduce:v1 + + # DCGM configuration + dcgm: + hostengineAddr: "dcgm-hostengine.nvsentinel.svc:5555" # DCGM Service address + diagLevel: 1 # 1 (quick, ~30s) or 2 (extended, ~2-3min) + + # NCCL test configuration + nccl: + loopbackThresholdGBps: 10.0 # Min bus bandwidth for loopback pass + allreduceThresholdGBps: 5.0 # Min bus bandwidth for all-reduce pass + + checkTimeout: "300s" # Per-check timeout + gangTimeout: "600s" # Gang coordination timeout + + # Gang discovery configuration + gangDiscovery: + # Options: workloadRef, volcano, kueue, labels + method: "workloadRef" + # For label-based discovery: + # labels: + # gangIdLabel: "app.kubernetes.io/gang-id" + # gangSizeLabel: "app.kubernetes.io/gang-size" + + # GPU detection configuration + gpuDetection: + # Extended resources (current approach) + resourceNames: + - "nvidia.com/gpu" + + # DRA device classes (add your cluster's DeviceClass names) + deviceClasses: [] + # Example: + # - "gpu.nvidia.com" + # - "nvidia.com/gpu" + + # Network device resources (for NCCL tests) + networkDetection: + # Extended resources (cluster-specific, configure for your environment) + resourceNames: + - "nvidia.com/mlnxnics" # Mellanox/NVIDIA InfiniBand NICs + # Add other network device plugin resources used in your cluster + + # DRA device classes (if using DRA for network devices) + deviceClasses: [] + # Example: + # - "rdma.nvidia.com" + # - "infiniband.mellanox.com" + + # NCCL environment variable patterns to copy (glob patterns) + # Webhook scans container env vars, copies those matching any pattern + ncclEnvPatterns: + - "NCCL_*" # Matches NCCL_TOPO_FILE, NCCL_IB_*, etc. + - "UCX_*" # Matches UCX_TLS, UCX_NET_DEVICES, etc. + - "OMPI_*" # Matches OMPI_MCA_*, etc. + + # NCCL topology auto-detection (if user doesn't provide topology file) + ncclTopology: + # Node label to detect instance type + instanceTypeLabel: "node.kubernetes.io/instance-type" + # Map instance types to built-in topology files + instanceTypeMapping: + "Standard_ND96isr_H100_v5": "azure-ndv5.xml" + "Standard_ND96amsr_A100_v4": "azure-ndv4.xml" + "p5.48xlarge": "aws-p5.48xlarge.xml" + "a3-megagpu-8g": "gcp-a3-mega.xml" + # Fallback: use NCCL auto-detection if instance type unknown + enableFallback: true + + # Namespaces where preflight checks apply + namespaces: + - training + + # Namespaces to exclude (system namespaces). Recommended to reuse node-drainer `systemNamespaces`. + excludeNamespaces: + - nvsentinel + - kube-system + - kube-public + - kube-node-lease + + webhook: + failurePolicy: Fail # or Ignore +``` + +All GPU pods in listed namespaces get the configured checks. + +### Metrics + +**Check containers** (exposed via pushgateway or scraped from pod annotations): + +| Metric | Type | Labels | +|------------------------------------|-----------|-------------------------------| +| `preflight_check_total` | Counter | `check`, `result` | +| `preflight_check_duration_seconds` | Histogram | `check` | +| `preflight_check_failures_total` | Counter | `check`, `node`, `error_code` | +| `preflight_gang_wait_seconds` | Histogram | `workload` | +| `preflight_config_errors_total` | Counter | `error` | + + +**preflight/injector** (standard Prometheus endpoint): + +| Metric | Type | Labels | +|-------------------------------------|-----------|----------| +| `preflight_injection_total` | Counter | `result` | +| `preflight_webhook_latency_seconds` | Histogram | - | + + +## Rationale + +- Mutating webhook for transparent injection +- Non-privileged init containers (DCGM diag runs via remote hostengine) +- Namespace selector opt-in +- Deployment-level config (no per-workload changes) + +## Consequences + +### Positive +- Catches GPU failures before workload starts +- Works with any workload controller +- Unprivileged init container (uses DCGM hostengine) +- Built-in NCCL topology files for major cloud platforms + +### Negative +- Adds 30-60s pod startup latency (DCGM diag level 1) +- Requires DCGM hostengine DaemonSet for diag checks +- Webhook downtime blocks pod creation (if `failurePolicy: Fail`) +- NCCL tests require network device plugins (InfiniBand/RDMA) to be configured +- Gang-wide NCCL tests require K8s 1.35+ (`workloadRef`) + +### Mitigations +- **Latency**: Use DCGM level 1 (~30s) vs level 2 (~2-3min); skip expensive checks for non-critical workloads +- **DCGM dependency**: Most GPU clusters already run DCGM for monitoring; expose as Service +- **Webhook availability**: HA deployment (replicas, PDB); `failurePolicy: Ignore` for graceful degradation +- **Network resources**: NCCL tests skipped if network devices unavailable; DCGM diag runs regardless +- **K8s version**: NCCL loopback (single-node) works without `workloadRef`; gang tests are opt-in + +## Alternatives Considered + +### Kyverno Policy +Rejected: External dependency. + +### User-managed init containers +Rejected: No enforcement. Users forget. + +### Custom CRD wrapper +Rejected: Requires changing how workloads are deployed. + +## Out of Scope + +- **Repeated failure handling**: Health Event Analyzer handles pattern detection. Preflight emits events. +- **Automatic DRA DeviceClass discovery**: Requires operator configuration. Device class names are not standardized. + +## References + +- K8s 1.35 Workload API: https://kubernetes.io/blog/2025/12/29/kubernetes-v1-35-introducing-workload-aware-scheduling/ +- GitHub Issue: https://github.com/NVIDIA/NVSentinel/issues/658 +