From f3ce511028b95e99eddb93361b5b16c075bb3602 Mon Sep 17 00:00:00 2001 From: Ajay Mishra Date: Tue, 6 Jan 2026 15:10:05 +0530 Subject: [PATCH 01/11] docs: add design doc for preflight check Signed-off-by: Ajay Mishra --- docs/designs/026-preflight-checks.md | 332 +++++++++++++++++++++++++++ 1 file changed, 332 insertions(+) create mode 100644 docs/designs/026-preflight-checks.md diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md new file mode 100644 index 000000000..1f6cb9167 --- /dev/null +++ b/docs/designs/026-preflight-checks.md @@ -0,0 +1,332 @@ +# ADR-026: Feature — Preflight Checks via Init Container Injection + +## Context + +Kubernetes 1.35 introduced `spec.workloadRef` for gang scheduling. GPU failures during distributed training waste compute time. Running diagnostics before the workload starts catches bad GPUs early. + +### Distinction from Health Monitors + +NVSentinel already has health monitors (GPU Health Monitor, Syslog Health Monitor) that detect GPU issues. This is different: + +| | Health Monitors | Preflight Checks | +|-|-----------------|------------------| +| When | Continuous (DaemonSet) | Once at pod start (init container) | +| Check type | Passive (health watches, syslog parsing) | Active diagnostics (DCGM diag) | +| Detects | Failures as they occur (XID errors, ECC, thermal) | Latent issues before starting | +| NCCL tests | No | Yes | +| Purpose | Reactive remediation | Prevent bad starts | + +Preflight asks "is this GPU healthy enough to start?" Health monitors ask "did this GPU fail while running?" + +## Decision + +Implement a MutatingAdmissionWebhook that injects preflight check init containers into pods that have `spec.workloadRef`. + +## Implementation + +### Component Structure + +``` +preflight-injector/ +├── main.go +├── go.mod +├── go.sum +├── Makefile +├── Tiltfile +└── pkg/ + ├── config/ + │ ├── config.go + │ └── config_test.go + ├── webhook/ + │ └── v1alpha1/ + │ ├── handler.go # Admission handler + │ └── handler_test.go + ├── injection/ + │ ├── injector.go # Init container construction + │ └── injector_test.go + ├── coordination/ + │ ├── discovery.go # Peer discovery via workloadRef + │ └── configmap.go # NCCL ID ConfigMap management + └── metrics/ + └── metrics.go +``` + +### Webhook Flow + +```mermaid +flowchart TD + A[Pod CREATE request] --> B{Has GPU resource?} + B -->|No| C[Allow - no mutation] + B -->|Yes| D[Inject init container] + D --> E[Return JSON patch] +``` + +Namespace filtering handled by `namespaceSelector` in webhook config. Checks configured at deployment time. + +### MutatingWebhookConfiguration + +```yaml +apiVersion: admissionregistration.k8s.io/v1 +kind: MutatingWebhookConfiguration +metadata: + name: preflight-injector +webhooks: + - name: preflight.nvsentinel.nvidia.com + clientConfig: + service: + name: preflight-injector + namespace: nvsentinel + path: /mutate-pod + rules: + - apiGroups: [""] + apiVersions: ["v1"] + resources: ["pods"] + operations: ["CREATE"] + namespaceSelector: + matchExpressions: + - key: kubernetes.io/metadata.name + operator: In + values: [] # Populated from Helm values + failurePolicy: Fail + sideEffects: None + admissionReviewVersions: ["v1"] +``` + +Namespace list populated from Helm values. + +### Init Container Spec + +```yaml +initContainers: + - name: nvsentinel-preflight + image: ghcr.io/nvidia/nvsentinel/preflight-checker:v1 + env: + - name: PREFLIGHT_CHECKS + value: "dcgm-diag,nccl-loopback" + - name: DCGM_DIAG_LEVEL + value: "1" + - name: CHECK_TIMEOUT + value: "300s" + - name: GANG_TIMEOUT + value: "600s" + resources: + limits: + nvidia.com/gpu: 8 # Copied from main container + securityContext: + privileged: true + volumeMounts: + - name: dcgm-socket + mountPath: /var/run/nvidia + - name: platform-connector-socket + mountPath: /var/run/nvsentinel +``` + +**GPU resource handling:** Webhook copies `nvidia.com/gpu` from main container to init container (GPU allocation is per-pod). + +### Check Types + +| Check | Scope | Coordination | +|-------|-------|--------------| +| `dcgm-diag` | Single node | None | +| `nccl-loopback` | Single node | None | +| `nccl-allreduce` | Gang-wide | ConfigMap | +| `plugin:` | Varies | Varies | + +### Plugin Interface (Third-Party Checks) + +Plugins are separate init containers. Webhook injects one container per plugin. + +**Registration:** +```yaml +preflight-injector: + plugins: + - name: bandwidth-check + image: myregistry/bandwidth-check:v1 + timeout: "60s" +``` + +**Injected init containers:** +```yaml +initContainers: + # Built-in checks + - name: nvsentinel-preflight + image: ghcr.io/nvidia/nvsentinel/preflight-checker:v1 + ... + + # Plugin (separate container) + - name: preflight-bandwidth-check + image: myregistry/bandwidth-check:v1 + env: + - name: CHECK_TIMEOUT + value: "60s" + - name: NODE_NAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName +``` + +**Plugin contract:** +- Exit `0` on success, non-zero on failure +- Write HealthEvent to Platform Connector socket (same as built-in checks) +- Plugin sets `isFatal`, `recommendedAction` in HealthEvent +- Platform Connector overrides can modify if operator disagrees (existing feature) +- Webhook mounts same volumes (GPU, DCGM socket, Platform Connector socket) + +### Configuration + +Configured at deployment time via Helm values. No per-workload annotations. + +### Gang Coordination + +For gang-wide checks like `nccl-allreduce`, pods discover peers using `workloadRef`: + +```mermaid +sequenceDiagram + participant R0 as Rank 0 Init + participant R1 as Rank 1 Init + participant API as Kube API + participant CM as ConfigMap + + R0->>API: List pods with same workloadRef + R1->>API: List pods with same workloadRef + + Note over R0,R1: Determine rank by sorting pod names + + R0->>CM: Create ConfigMap with NCCL unique ID + R1->>CM: Poll until ConfigMap exists + R1->>CM: Read NCCL unique ID + + R0->>R1: nccl.init() (barrier inside NCCL) + R0->>R1: nccl.all_reduce() +``` + +**Peer discovery via workloadRef:** +- Init container lists pods where `workloadRef.name` and `workloadRef.podGroup` match +- Gets peer IPs directly from pod list +- Determines rank by sorting pod names alphabetically + +**NCCL ID sharing:** +- Rank 0 creates ConfigMap named `preflight-{workload}-{podgroup}` +- Other ranks poll until ConfigMap exists (10 min timeout) +- ConfigMap has owner reference to Workload for cleanup + +**Webhook just injects the init container.** No Service or other resources needed. + +**Gang coordination timeout:** 10 minutes. If gang doesn't form, init fails with `isFatal: false` (not a hardware issue). + +### Failure Behavior + +Init container exit codes: +- `0`: All checks passed +- `1`: Check failed, pod should not start +- `2`: Configuration error + +On failure: +- Pod stays in `Init:Error` state +- **HealthEvent created** via Platform Connector (same as health monitors) +- Kubernetes Event created with failure details +- Metrics incremented (`preflight_check_failures_total`) + +HealthEvent feeds into existing NVSentinel workflow (quarantine, correlation, etc). + +### Error to Recommended Action Mapping + +**DCGM Diag** : + +| Test | Result | Recommended Action | +|------|--------|-------------------| +| Memory | `FAIL` | `CONTACT_SUPPORT` | +| PCIe | `FAIL` | `CONTACT_SUPPORT` | +| NVLink | `FAIL` | `CONTACT_SUPPORT` | +| Stress | `FAIL` | `RUN_DCGMEUD` | +| Any | `WARN` | `NONE` | + +**NCCL Checks**: + +| Error | Recommended Action | +|-------|-------------------| +| `NCCL_SYSTEM_ERROR` | `CONTACT_SUPPORT` | +| `NCCL_INTERNAL_ERROR` | `RUN_DCGMEUD` | +| `NCCL_INVALID_USAGE` | `NONE` | +| `NCCL_TIMEOUT` | `NONE` | +| `NCCL_REMOTE_ERROR` | `CONTACT_SUPPORT` | + +**isFatal determination**: +- DCGM diag `FAIL` → `isFatal: true` +- DCGM diag `WARN` → `isFatal: false` +- NCCL hardware errors (`SYSTEM_ERROR`, `INTERNAL_ERROR`, `REMOTE_ERROR`) → `isFatal: true` +- NCCL timeout/config errors → `isFatal: false` + +### Helm Values + +```yaml +preflight-injector: + enabled: false # Opt-in + + checks: + - dcgm-diag + - nccl-loopback + # - nccl-allreduce # Enable for gang workloads + + dcgmDiagLevel: 1 # 1 (quick, ~30s) or 2 (medium, ~2-3min) + checkTimeout: "300s" # Per-check timeout + gangTimeout: "600s" # Gang coordination timeout + + # Namespaces where preflight checks apply + namespaces: + - training + + webhook: + failurePolicy: Fail # or Ignore + + image: + repository: ghcr.io/nvidia/nvsentinel/preflight-checker + tag: v1 +``` + +All GPU pods in listed namespaces get the configured checks. + +## Rationale + +- Mutating webhook requires no external dependencies +- Init containers are native Kubernetes +- Opt-in via namespace selector +- Deployment-level config, no user workload changes + +## Consequences + +### Positive +- Catches GPU failures before workload starts +- Works with any workload controller +- No user workload changes + +### Negative +- Adds 30-60s pod startup latency (DCGM diag) +- Requires privileged init container +- Webhook downtime blocks pod creation + +### Mitigations +- `failurePolicy: Ignore` if latency unacceptable +- Timeout configuration +- HA deployment (replicas, PDB) + +## Alternatives Considered + +### Kyverno Policy +Rejected: External dependency. + +### User-managed init containers +Rejected: No enforcement. Users forget. + +### Custom CRD wrapper +Rejected: Requires changing how workloads are deployed. + +## Out of Scope + +- **Repeated failure handling**: Health Event Analyzer handles pattern detection on HealthEvents. Preflight just emits events. + +## References + +- K8s 1.35 Workload API: https://kubernetes.io/blog/2025/12/29/kubernetes-v1-35-introducing-workload-aware-scheduling/ +- GitHub Issue: https://github.com/NVIDIA/NVSentinel/issues/658 + From 6606d6ba6c0c94261a01dd7fbdb2bc0b4cb753be Mon Sep 17 00:00:00 2001 From: Ajay Mishra Date: Tue, 6 Jan 2026 15:23:13 +0530 Subject: [PATCH 02/11] docs: few changes Signed-off-by: Ajay Mishra --- docs/designs/026-preflight-checks.md | 26 +++++++++++++++----------- 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md index 1f6cb9167..ee0562681 100644 --- a/docs/designs/026-preflight-checks.md +++ b/docs/designs/026-preflight-checks.md @@ -2,7 +2,9 @@ ## Context -Kubernetes 1.35 introduced `spec.workloadRef` for gang scheduling. GPU failures during distributed training waste compute time. Running diagnostics before the workload starts catches bad GPUs early. +GPU failures during training waste compute time. Running diagnostics before the workload starts catches bad GPUs early. + +Kubernetes 1.35 introduced `spec.workloadRef` for gang scheduling, which enables gang-wide checks (NCCL all-reduce across multiple pods). ### Distinction from Health Monitors @@ -20,7 +22,10 @@ Preflight asks "is this GPU healthy enough to start?" Health monitors ask "did t ## Decision -Implement a MutatingAdmissionWebhook that injects preflight check init containers into pods that have `spec.workloadRef`. +Implement a MutatingAdmissionWebhook that injects preflight check init containers into GPU pods (pods requesting `nvidia.com/gpu`) in configured namespaces. + +- Injection trigger: GPU resource request + namespace +- Gang coordination (NCCL all-reduce): Uses `workloadRef` if present, skipped otherwise ## Implementation @@ -166,10 +171,10 @@ initContainers: ``` **Plugin contract:** -- Exit `0` on success, non-zero on failure +- Exit codes: `0` (passed), `1` (check failed), `2` (config error) - Write HealthEvent to Platform Connector socket (same as built-in checks) - Plugin sets `isFatal`, `recommendedAction` in HealthEvent -- Platform Connector overrides can modify if operator disagrees (existing feature) +- Platform Connector overrides can modify values - Webhook mounts same volumes (GPU, DCGM socket, Platform Connector socket) ### Configuration @@ -210,7 +215,7 @@ sequenceDiagram - Other ranks poll until ConfigMap exists (10 min timeout) - ConfigMap has owner reference to Workload for cleanup -**Webhook just injects the init container.** No Service or other resources needed. +Webhook injects the init container. No Service or other resources created. **Gang coordination timeout:** 10 minutes. If gang doesn't form, init fails with `isFatal: false` (not a hardware issue). @@ -288,17 +293,16 @@ All GPU pods in listed namespaces get the configured checks. ## Rationale -- Mutating webhook requires no external dependencies -- Init containers are native Kubernetes -- Opt-in via namespace selector -- Deployment-level config, no user workload changes +- Mutating webhook, no external dependencies +- Init containers +- Namespace selector opt-in +- Deployment-level config ## Consequences ### Positive - Catches GPU failures before workload starts - Works with any workload controller -- No user workload changes ### Negative - Adds 30-60s pod startup latency (DCGM diag) @@ -323,7 +327,7 @@ Rejected: Requires changing how workloads are deployed. ## Out of Scope -- **Repeated failure handling**: Health Event Analyzer handles pattern detection on HealthEvents. Preflight just emits events. +- **Repeated failure handling**: Health Event Analyzer handles pattern detection. Preflight emits events. ## References From cbc9501923eda94b09eb39474103c55c50eff919 Mon Sep 17 00:00:00 2001 From: Ajay Mishra Date: Tue, 6 Jan 2026 15:24:40 +0530 Subject: [PATCH 03/11] docs: few changes Signed-off-by: Ajay Mishra --- docs/designs/026-preflight-checks.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md index ee0562681..aee53489a 100644 --- a/docs/designs/026-preflight-checks.md +++ b/docs/designs/026-preflight-checks.md @@ -4,7 +4,7 @@ GPU failures during training waste compute time. Running diagnostics before the workload starts catches bad GPUs early. -Kubernetes 1.35 introduced `spec.workloadRef` for gang scheduling, which enables gang-wide checks (NCCL all-reduce across multiple pods). +Kubernetes 1.35 introduced `spec.workloadRef` for gang scheduling. Preflight can use `workloadRef` to discover peer pods and run gang-wide checks (NCCL all-reduce). ### Distinction from Health Monitors From 26c804a21dcdb7e3fc1d1a51fd0684878ee6e89e Mon Sep 17 00:00:00 2001 From: Ajay Mishra Date: Tue, 6 Jan 2026 15:44:20 +0530 Subject: [PATCH 04/11] chore: minor changes Signed-off-by: Ajay Mishra --- docs/designs/026-preflight-checks.md | 83 ++++++++++++++++++++-------- 1 file changed, 61 insertions(+), 22 deletions(-) diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md index aee53489a..2cc3870d3 100644 --- a/docs/designs/026-preflight-checks.md +++ b/docs/designs/026-preflight-checks.md @@ -32,28 +32,48 @@ Implement a MutatingAdmissionWebhook that injects preflight check init container ### Component Structure ``` -preflight-injector/ -├── main.go -├── go.mod -├── go.sum -├── Makefile -├── Tiltfile -└── pkg/ - ├── config/ - │ ├── config.go - │ └── config_test.go - ├── webhook/ - │ └── v1alpha1/ - │ ├── handler.go # Admission handler - │ └── handler_test.go - ├── injection/ - │ ├── injector.go # Init container construction - │ └── injector_test.go - ├── coordination/ - │ ├── discovery.go # Peer discovery via workloadRef - │ └── configmap.go # NCCL ID ConfigMap management - └── metrics/ - └── metrics.go +preflight/ +├── injector/ # Webhook (Deployment) +│ ├── main.go +│ ├── go.mod +│ ├── Makefile +│ ├── Tiltfile +│ └── pkg/ +│ ├── config/ +│ │ └── config.go +│ ├── webhook/ +│ │ └── v1alpha1/ +│ │ ├── handler.go +│ │ └── handler_test.go +│ ├── injection/ +│ │ ├── injector.go +│ │ └── injector_test.go +│ └── metrics/ +│ └── metrics.go +│ +├── checker/ # Init container image +│ ├── main.go +│ ├── go.mod +│ ├── Makefile +│ ├── Tiltfile +│ └── pkg/ +│ ├── runner/ +│ │ └── runner.go +│ ├── checks/ +│ │ ├── dcgm/ +│ │ │ └── diag.go # dcgmi diag -r 1/2 +│ │ └── nccl/ +│ │ ├── loopback.go +│ │ └── allreduce.go +│ ├── coordination/ +│ │ ├── discovery.go # Peer discovery via workloadRef +│ │ └── configmap.go # NCCL ID sharing +│ ├── reporting/ +│ │ └── healthevents.go +│ └── metrics/ +│ └── metrics.go +│ +└── Makefile # Builds both ``` ### Webhook Flow @@ -291,6 +311,25 @@ preflight-injector: All GPU pods in listed namespaces get the configured checks. +### Metrics + +**preflight/checker** (exposed via pushgateway or scraped from pod annotations): + +| Metric | Type | Labels | +|--------|------|--------| +| `preflight_check_total` | Counter | `check`, `result` | +| `preflight_check_duration_seconds` | Histogram | `check` | +| `preflight_check_failures_total` | Counter | `check`, `node`, `error_code` | +| `preflight_gang_wait_seconds` | Histogram | `workload` | +| `preflight_config_errors_total` | Counter | `error` | + +**preflight/injector** (standard Prometheus endpoint): + +| Metric | Type | Labels | +|--------|------|--------| +| `preflight_injection_total` | Counter | `result` | +| `preflight_webhook_latency_seconds` | Histogram | - | + ## Rationale - Mutating webhook, no external dependencies From 9de7c2a573b083feb0bd47807fb7f327382147c8 Mon Sep 17 00:00:00 2001 From: Ajay Mishra Date: Tue, 6 Jan 2026 15:46:20 +0530 Subject: [PATCH 05/11] chore: minor changes Signed-off-by: Ajay Mishra --- docs/designs/026-preflight-checks.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md index 2cc3870d3..d7d9647b2 100644 --- a/docs/designs/026-preflight-checks.md +++ b/docs/designs/026-preflight-checks.md @@ -344,14 +344,14 @@ All GPU pods in listed namespaces get the configured checks. - Works with any workload controller ### Negative -- Adds 30-60s pod startup latency (DCGM diag) -- Requires privileged init container -- Webhook downtime blocks pod creation +- Adds 30-60s pod startup latency (DCGM diag level 1) +- Requires privileged init container for DCGM +- Webhook downtime blocks pod creation (if `failurePolicy: Fail`) ### Mitigations -- `failurePolicy: Ignore` if latency unacceptable -- Timeout configuration -- HA deployment (replicas, PDB) +- **Latency**: Use DCGM level 1 (~30s) instead of level 2 (~2-3min); skip expensive checks for non-critical workloads +- **Privileged**: Required for hardware access; limit to specific namespaces +- **Webhook availability**: HA deployment (replicas, PDB); `failurePolicy: Ignore` allows pods through if webhook is down ## Alternatives Considered From 2a06cc37aee1cb2a8a43eea54b419a21ea9fe079 Mon Sep 17 00:00:00 2001 From: Ajay Mishra Date: Thu, 15 Jan 2026 09:17:25 +0530 Subject: [PATCH 06/11] chore: address review comments Signed-off-by: Ajay Mishra --- docs/designs/026-preflight-checks.md | 443 ++++++++++++++++++++++----- 1 file changed, 361 insertions(+), 82 deletions(-) diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md index d7d9647b2..a4f7995d2 100644 --- a/docs/designs/026-preflight-checks.md +++ b/docs/designs/026-preflight-checks.md @@ -22,73 +22,53 @@ Preflight asks "is this GPU healthy enough to start?" Health monitors ask "did t ## Decision -Implement a MutatingAdmissionWebhook that injects preflight check init containers into GPU pods (pods requesting `nvidia.com/gpu`) in configured namespaces. +Implement a MutatingAdmissionWebhook that injects preflight check init containers into GPU pods in configured namespaces. -- Injection trigger: GPU resource request + namespace -- Gang coordination (NCCL all-reduce): Uses `workloadRef` if present, skipped otherwise +### Key points -## Implementation +- Injection trigger: GPU resources (extended resources or DRA claims) + namespace +- Gang coordination: Uses `workloadRef` for gang-wide checks when present +- Resource detection: Configurable lists for extended resource names and DRA device classes -### Component Structure +## Architecture + +### Components ``` preflight/ ├── injector/ # Webhook (Deployment) -│ ├── main.go -│ ├── go.mod -│ ├── Makefile -│ ├── Tiltfile -│ └── pkg/ -│ ├── config/ -│ │ └── config.go -│ ├── webhook/ -│ │ └── v1alpha1/ -│ │ ├── handler.go -│ │ └── handler_test.go -│ ├── injection/ -│ │ ├── injector.go -│ │ └── injector_test.go -│ └── metrics/ -│ └── metrics.go -│ -├── checker/ # Init container image -│ ├── main.go -│ ├── go.mod -│ ├── Makefile -│ ├── Tiltfile │ └── pkg/ -│ ├── runner/ -│ │ └── runner.go -│ ├── checks/ -│ │ ├── dcgm/ -│ │ │ └── diag.go # dcgmi diag -r 1/2 -│ │ └── nccl/ -│ │ ├── loopback.go -│ │ └── allreduce.go -│ ├── coordination/ -│ │ ├── discovery.go # Peer discovery via workloadRef -│ │ └── configmap.go # NCCL ID sharing -│ ├── reporting/ -│ │ └── healthevents.go -│ └── metrics/ -│ └── metrics.go +│ ├── webhook/ # Admission handler +│ └── injection/ # Pod mutation + DRA detection │ -└── Makefile # Builds both +└── checker/ # Init container image + ├── nccl-topologies/ # Built-in topology files + └── pkg/ + ├── checks/ # dcgm + nccl + ├── coordination/ # gang registration + NCCL ID + └── reporting/ # HealthEvent reporting ``` -### Webhook Flow +### Webhook flow ```mermaid flowchart TD - A[Pod CREATE request] --> B{Has GPU resource?} - B -->|No| C[Allow - no mutation] - B -->|Yes| D[Inject init container] + A[Pod CREATE request] --> B{GPU resources?} + B -->|No| C[Allow] + B -->|Yes| D[Inject init containers] D --> E[Return JSON patch] ``` -Namespace filtering handled by `namespaceSelector` in webhook config. Checks configured at deployment time. +Namespace filtering handled by `namespaceSelector` in webhook config. -### MutatingWebhookConfiguration +### Namespace model + +- NVSentinel Helm chart is installed in `nvsentinel` namespace (webhook Deployment runs there). +- Webhook mutates Pods in *other* namespaces based on `namespaceSelector` (and skips system namespaces). +- The injected init containers run in the workload namespace. +- Any Kubernetes API access needed by the init container (gang coordination ConfigMap + Workload reads) must be granted in the workload namespace (namespace-scoped Role/RoleBinding). This is created by the Helm chart in the opted-in namespaces. + +### MutatingWebhookConfiguration (sketch) ```yaml apiVersion: admissionregistration.k8s.io/v1 @@ -112,14 +92,22 @@ webhooks: - key: kubernetes.io/metadata.name operator: In values: [] # Populated from Helm values + - key: kubernetes.io/metadata.name + operator: NotIn + values: [] # Excluded namespaces (systemNamespaces, nvsentinel, etc.) failurePolicy: Fail sideEffects: None admissionReviewVersions: ["v1"] ``` -Namespace list populated from Helm values. +## Resource detection and injection -### Init Container Spec +### Detection logic + +1. Extended resources (device plugins): check `resources.limits`/`resources.requests` for configured names (e.g. `nvidia.com/gpu`) +2. DRA: check `spec.resourceClaims`, resolve claim/template, match `deviceClassName` against configured list + +### Init container spec (sketch) ```yaml initContainers: @@ -134,21 +122,36 @@ initContainers: value: "300s" - name: GANG_TIMEOUT value: "600s" + - name: PLATFORM_CONNECTOR_SOCKET + value: "unix:///var/run/nvsentinel.sock" + - name: MY_POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + - name: MY_POD_IP + valueFrom: + fieldRef: + fieldPath: status.podIP resources: limits: - nvidia.com/gpu: 8 # Copied from main container + nvidia.com/gpu: 8 # Max across all containers + nvidia.com/mlnxnics: 4 # Max across all containers (if NCCL enabled) securityContext: - privileged: true + privileged: true # DCGM diag volumeMounts: - name: dcgm-socket mountPath: /var/run/nvidia - name: platform-connector-socket - mountPath: /var/run/nvsentinel + mountPath: /var/run ``` -**GPU resource handling:** Webhook copies `nvidia.com/gpu` from main container to init container (GPU allocation is per-pod). +### Resource handling + +- GPUs / extended resources: inject max across all containers +- Network / extended resources: inject max across all containers for configured names +- DRA: inject all referenced GPU/network claims into init container -### Check Types +## Check types | Check | Scope | Coordination | |-------|-------|--------------| @@ -192,10 +195,12 @@ initContainers: **Plugin contract:** - Exit codes: `0` (passed), `1` (check failed), `2` (config error) -- Write HealthEvent to Platform Connector socket (same as built-in checks) -- Plugin sets `isFatal`, `recommendedAction` in HealthEvent -- Platform Connector overrides can modify values -- Webhook mounts same volumes (GPU, DCGM socket, Platform Connector socket) +- Report failures via gRPC to Platform Connector: + - Unix socket: `unix:///var/run/nvsentinel.sock` (matches global `socketPath`) + - Use `HealthEventOccurredV1` RPC (service `PlatformConnector`, proto `data-models/protobufs/health_event.proto`) + - Plugin sets `isFatal`, `recommendedAction`, `errorCode` in HealthEvent + - Platform Connector overrides can modify these values via CEL rules +- Webhook mounts required volumes: GPU devices, DCGM socket, Platform Connector socket ### Configuration @@ -203,47 +208,237 @@ Configured at deployment time via Helm values. No per-workload annotations. ### Gang Coordination -For gang-wide checks like `nccl-allreduce`, pods discover peers using `workloadRef`: +For gang-wide checks like `nccl-allreduce`, pods discover peers via ConfigMap registration: ```mermaid sequenceDiagram - participant R0 as Rank 0 Init - participant R1 as Rank 1 Init + participant W as Webhook + participant P0 as Pod 0 Init + participant P1 as Pod 1 Init participant API as Kube API participant CM as ConfigMap - R0->>API: List pods with same workloadRef - R1->>API: List pods with same workloadRef + Note over W: First pod in gang + W->>API: Create ConfigMap (expected=2, peers="") + + P0->>API: Patch ConfigMap: add pod-0:10.0.1.5 + P1->>API: Patch ConfigMap: add pod-1:10.0.1.6 - Note over R0,R1: Determine rank by sorting pod names + P0->>API: Poll until len(peers) == expected + P1->>API: Poll until len(peers) == expected - R0->>CM: Create ConfigMap with NCCL unique ID - R1->>CM: Poll until ConfigMap exists - R1->>CM: Read NCCL unique ID + Note over P0,P1: Determine rank by sorting pod names - R0->>R1: nccl.init() (barrier inside NCCL) - R0->>R1: nccl.all_reduce() + P0->>CM: Update with NCCL unique ID + P1->>CM: Read NCCL unique ID + + P0->>P1: nccl.init() (barrier inside NCCL) + P0->>P1: nccl.all_reduce() ``` -**Peer discovery via workloadRef:** -- Init container lists pods where `workloadRef.name` and `workloadRef.podGroup` match -- Gets peer IPs directly from pod list +**Peer registration (no pod listing):** +- Webhook idempotently creates ConfigMap named `preflight--` with `expected_count` +- Each init container patches ConfigMap to add its IP +- Init containers poll until all peers register - Determines rank by sorting pod names alphabetically -**NCCL ID sharing:** -- Rank 0 creates ConfigMap named `preflight-{workload}-{podgroup}` -- Other ranks poll until ConfigMap exists (10 min timeout) -- ConfigMap has owner reference to Workload for cleanup +**ConfigMap structure:** +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: preflight-myworkload-group1 + ownerReferences: + - apiVersion: scheduling.k8s.io/v1alpha1 + kind: Workload + name: myworkload +data: + expected_count: "2" + peers: | + pod-0:10.0.1.5 + pod-1:10.0.1.6 + nccl_unique_id: "base64..." # Added by rank 0 +``` -Webhook injects the init container. No Service or other resources created. +**Security:** Init containers have minimal RBAC (get/patch ConfigMap, get Workload). No pod list permission. **Gang coordination timeout:** 10 minutes. If gang doesn't form, init fails with `isFatal: false` (not a hardware issue). +### RBAC (gang coordination) + +Use a namespace-scoped Role for coordination. Kubernetes RBAC does not support label-based restrictions for ConfigMaps, so the checker enforces scope in code (expected ConfigMap name + required labels/ownerRef). + +```yaml +rules: + - apiGroups: ["scheduling.k8s.io"] + resources: ["workloads"] + verbs: ["get"] + - apiGroups: [""] + resources: ["configmaps"] + verbs: ["get", "create", "patch"] +``` + +Checker only reads/writes the coordination ConfigMap `preflight--` in its own namespace. + +### DRA Integration + +For pods using Dynamic Resource Allocation (DRA), the webhook copies resource claim references to the init container. + +**Device claim detection:** +Webhook checks pod's `spec.resourceClaims`, retrieves each ResourceClaim or ResourceClaimTemplate, and matches `deviceClassName` against configurable lists for GPUs and network devices: + +```yaml +# Helm values +preflight-injector: + gpuDetection: + # Extended resources (current, no DRA) + resourceNames: + - "nvidia.com/gpu" + + # DRA device classes (requires operator configuration) + deviceClasses: + - "gpu.nvidia.com" + - "nvidia.com/gpu" + # Operators add their DeviceClass names here +``` + +**Init container injection with DRA:** +```yaml +apiVersion: v1 +kind: Pod +spec: + # Pod-level claims + resourceClaims: + - name: gpu-claim + resourceClaimName: training-gpus + - name: rdma-claim + resourceClaimName: training-rdma + + initContainers: + - name: nvsentinel-preflight + resources: + claims: + - name: gpu-claim # References same GPU claim + - name: rdma-claim # References same network claim + + containers: + - name: main + resources: + claims: + - name: gpu-claim + - name: rdma-claim +``` + +**Multiple containers with GPUs:** +```yaml +# Extended resources example +containers: + - name: trainer + resources: + limits: + nvidia.com/gpu: 4 + nvidia.com/mlnxnics: 2 + - name: validator + resources: + limits: + nvidia.com/gpu: 8 + nvidia.com/mlnxnics: 4 + +# Init container gets max(4, 8) = 8 GPUs, max(2, 4) = 4 NICs +initContainers: + - name: nvsentinel-preflight + resources: + limits: + nvidia.com/gpu: 8 + nvidia.com/mlnxnics: 4 +``` + +**Detection logic:** +1. Check if pod uses extended resources (`nvidia.com/gpu`, `nvidia.com/mlnxnics`) → inject with max counts across all containers +2. Check if pod has DRA claims with matching `deviceClassName` → inject with all unique GPU and network claim references +3. If neither → skip injection + +Network devices (InfiniBand, RDMA) can be exposed via DRA claims or extended resources. Webhook uses same detection pattern for both. + +DRA device class names are not standardized. Operators configure `gpuDetection.deviceClasses` and `networkDetection.deviceClasses` to match cluster DeviceClass names. + +### Network Resources for NCCL Tests + +NCCL tests require access to RDMA/InfiniBand devices for efficient GPU-to-GPU communication. + +**Network device exposure methods:** + +1. **Extended resources (device plugins):** + - Example: `nvidia.com/mlnxnics` (common on GPU+IB clusters) + - Resource names are cluster-specific; configure `networkDetection.resourceNames` accordingly + +2. **DRA claims:** + - Network devices can also be exposed via DRA claims (DeviceClass names are cluster-specific) + - Webhook matches claim `deviceClassName` against `networkDetection.deviceClasses` + +**Webhook behavior for NCCL checks:** +If `nccl-loopback` or `nccl-allreduce` is enabled, webhook: +1. Copies all network device resources (extended resources using max count, or DRA claim references) +2. Scans all container env vars, copies those matching `ncclEnvPatterns` (glob patterns from Helm config) +3. Copies volume mounts referenced by `NCCL_TOPO_FILE` (if present) + +**Example: How env vars are copied** + +Main container has: +```yaml +env: + - name: NCCL_TOPO_FILE + value: /etc/nccl/topo.xml + - name: NCCL_IB_PCI_RELAXED_ORDERING + value: "1" + - name: NCCL_SOCKET_IFNAME + value: eth0 + - name: MY_APP_CONFIG + value: /app/config.yaml + - name: OMPI_MCA_btl + value: openib +``` + +Webhook with `ncclEnvPatterns: ["NCCL_*", "OMPI_*"]` copies to init container: +```yaml +env: + - name: NCCL_TOPO_FILE # Matches NCCL_* + value: /etc/nccl/topo.xml + - name: NCCL_IB_PCI_RELAXED_ORDERING # Matches NCCL_* + value: "1" + - name: NCCL_SOCKET_IFNAME # Matches NCCL_* + value: eth0 + - name: OMPI_MCA_btl # Matches OMPI_* + value: openib + # MY_APP_CONFIG NOT copied (doesn't match patterns) +volumeMounts: + - name: nccl-topology # Copied because NCCL_TOPO_FILE references it + mountPath: /etc/nccl +``` + +**NCCL topology file handling:** +The init container image includes common topology files for major cloud platforms: +``` +/opt/nvsentinel/nccl-topologies/ +├── azure-ndv4.xml +├── azure-ndv5.xml +├── aws-p5.48xlarge.xml +├── gcp-a3-mega.xml +└── oci-bm-gpu-a100.xml +``` + +**Topology selection priority:** +1. **User-provided**: Webhook checks if any container has `NCCL_TOPO_FILE` env var with a corresponding volume mount at that path → copy that volume mount to init container +2. **Auto-detect**: If no `NCCL_TOPO_FILE` + volume mount found, init container reads node label `node.kubernetes.io/instance-type`, maps to built-in topology file via Helm config +3. **Fallback**: If instance type unknown or not in mapping, don't set `NCCL_TOPO_FILE` (NCCL auto-detects topology) + +If pod has no network device resources, NCCL tests are skipped (DCGM diag runs). + ### Failure Behavior Init container exit codes: - `0`: All checks passed -- `1`: Check failed, pod should not start +- `1`: Check failed - `2`: Configuration error On failure: @@ -282,6 +477,33 @@ HealthEvent feeds into existing NVSentinel workflow (quarantine, correlation, et - NCCL hardware errors (`SYSTEM_ERROR`, `INTERNAL_ERROR`, `REMOTE_ERROR`) → `isFatal: true` - NCCL timeout/config errors → `isFatal: false` +### Integration with Node Drainer + +Preflight failures quarantine nodes without draining. Rationale: +- Workload never started → no pods to evict +- Draining would disrupt other gang members waiting for coordination +- Quarantine prevents new scheduling while remediation happens + +**Platform Connector override:** +```yaml +pipeline: + overrides: + - match: + agent: "preflight-checker" + override: + drainOverrides: + skip: true +``` + +**Flow:** +1. Preflight fails → HealthEvent with `isFatal: true` +2. Platform Connector applies override → `drainOverrides.skip: true` +3. Node drainer sees `skip: true` → quarantines node (taint), skips drain +4. Fault Remediation runs based on `recommendedAction` (EUD, support ticket, etc.) +5. Remediation succeeds → taint removed → node back in rotation + +Gang members on other nodes timeout after `gangTimeout`, fail with `isFatal: false` (coordination failure, not hardware), no quarantine. + ### Helm Values ```yaml @@ -297,9 +519,62 @@ preflight-injector: checkTimeout: "300s" # Per-check timeout gangTimeout: "600s" # Gang coordination timeout + # GPU detection configuration + gpuDetection: + # Extended resources (current approach) + resourceNames: + - "nvidia.com/gpu" + + # DRA device classes (add your cluster's DeviceClass names) + deviceClasses: [] + # Example: + # - "gpu.nvidia.com" + # - "nvidia.com/gpu" + + # Network device resources (for NCCL tests) + networkDetection: + # Extended resources + resourceNames: + - "nvidia.com/mlnxnics" + - "rdma/hca" + # Add other network device plugin resources used in your cluster + + # DRA device classes (if using DRA for network devices) + deviceClasses: [] + # Example: + # - "rdma.nvidia.com" + # - "infiniband.mellanox.com" + + # NCCL environment variable patterns to copy (glob patterns) + # Webhook scans container env vars, copies those matching any pattern + ncclEnvPatterns: + - "NCCL_*" # Matches NCCL_TOPO_FILE, NCCL_IB_*, etc. + - "UCX_*" # Matches UCX_TLS, UCX_NET_DEVICES, etc. + - "OMPI_*" # Matches OMPI_MCA_*, etc. + + # NCCL topology auto-detection (if user doesn't provide topology file) + ncclTopology: + # Node label to detect instance type + instanceTypeLabel: "node.kubernetes.io/instance-type" + # Map instance types to built-in topology files + instanceTypeMapping: + "Standard_ND96isr_H100_v5": "azure-ndv5.xml" + "Standard_ND96amsr_A100_v4": "azure-ndv4.xml" + "p5.48xlarge": "aws-p5.48xlarge.xml" + "a3-megagpu-8g": "gcp-a3-mega.xml" + # Fallback: use NCCL auto-detection if instance type unknown + enableFallback: true + # Namespaces where preflight checks apply namespaces: - training + + # Namespaces to exclude (system namespaces). Recommended to reuse node-drainer `systemNamespaces`. + excludeNamespaces: + - nvsentinel + - kube-system + - kube-public + - kube-node-lease webhook: failurePolicy: Fail # or Ignore @@ -342,16 +617,19 @@ All GPU pods in listed namespaces get the configured checks. ### Positive - Catches GPU failures before workload starts - Works with any workload controller +- Built-in NCCL topology files for major cloud platforms ### Negative - Adds 30-60s pod startup latency (DCGM diag level 1) - Requires privileged init container for DCGM - Webhook downtime blocks pod creation (if `failurePolicy: Fail`) +- NCCL tests require network device plugins (InfiniBand/RDMA) to be configured ### Mitigations -- **Latency**: Use DCGM level 1 (~30s) instead of level 2 (~2-3min); skip expensive checks for non-critical workloads +- **Latency**: Use DCGM level 1 (~30s) vs level 2 (~2-3min); skip expensive checks for non-critical workloads - **Privileged**: Required for hardware access; limit to specific namespaces -- **Webhook availability**: HA deployment (replicas, PDB); `failurePolicy: Ignore` allows pods through if webhook is down +- **Webhook availability**: HA deployment (replicas, PDB); `failurePolicy: Ignore` for graceful degradation +- **Network resources**: NCCL tests skipped if network devices unavailable; DCGM diag runs regardless ## Alternatives Considered @@ -367,6 +645,7 @@ Rejected: Requires changing how workloads are deployed. ## Out of Scope - **Repeated failure handling**: Health Event Analyzer handles pattern detection. Preflight emits events. +- **Automatic DRA DeviceClass discovery**: Requires operator configuration. Device class names are not standardized. ## References From d8052558690f774737fc624a17bf70ae10dbeb56 Mon Sep 17 00:00:00 2001 From: Ajay Mishra Date: Fri, 16 Jan 2026 13:25:42 +0530 Subject: [PATCH 07/11] chore: address review comments Signed-off-by: Ajay Mishra --- docs/designs/026-preflight-checks.md | 535 +++++++++++++++------------ 1 file changed, 293 insertions(+), 242 deletions(-) diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md index a4f7995d2..7a2a24caa 100644 --- a/docs/designs/026-preflight-checks.md +++ b/docs/designs/026-preflight-checks.md @@ -1,22 +1,21 @@ -# ADR-026: Feature — Preflight Checks via Init Container Injection +# ADR-026: Feature — Preflight Checks ## Context GPU failures during training waste compute time. Running diagnostics before the workload starts catches bad GPUs early. -Kubernetes 1.35 introduced `spec.workloadRef` for gang scheduling. Preflight can use `workloadRef` to discover peer pods and run gang-wide checks (NCCL all-reduce). +Gang-wide NCCL tests require discovering all pods in a gang. Kubernetes 1.35 introduced `spec.workloadRef` as a native gang identifier, but users may also use Volcano, Kueue, or other schedulers with their own mechanisms. ### Distinction from Health Monitors NVSentinel already has health monitors (GPU Health Monitor, Syslog Health Monitor) that detect GPU issues. This is different: -| | Health Monitors | Preflight Checks | -|-|-----------------|------------------| -| When | Continuous (DaemonSet) | Once at pod start (init container) | -| Check type | Passive (health watches, syslog parsing) | Active diagnostics (DCGM diag) | -| Detects | Failures as they occur (XID errors, ECC, thermal) | Latent issues before starting | -| NCCL tests | No | Yes | -| Purpose | Reactive remediation | Prevent bad starts | +| | Health Monitors | Preflight Checks | +|------------|------------------------|-------------------------------| +| When | Continuous | Once at pod start | +| Check type | Passive | Active diagnostics | +| Detects | Failures as they occur | Latent issues before starting | +| Purpose | Reactive remediation | Prevent bad starts | Preflight asks "is this GPU healthy enough to start?" Health monitors ask "did this GPU fail while running?" @@ -27,26 +26,43 @@ Implement a MutatingAdmissionWebhook that injects preflight check init container ### Key points - Injection trigger: GPU resources (extended resources or DRA claims) + namespace -- Gang coordination: Uses `workloadRef` for gang-wide checks when present +- Gang discovery: Pluggable (supports `workloadRef`; can be extended to Volcano, Kueue .etc.) - Resource detection: Configurable lists for extended resource names and DRA device classes ## Architecture ### Components +Each check is a separate image. Webhook injects one init container per enabled check. + ``` preflight/ -├── injector/ # Webhook (Deployment) +├── injector/ +│ └── pkg/ +│ ├── webhook/ +│ └── injection/ +│ +├── controller/ +│ └── pkg/ +│ ├── gang/ +│ └── coordination/ +│ +├── dcgm-diag/ +│ ├── Dockerfile +│ ├── main.go +│ └── pkg/ +│ +├── nccl-loopback/ +│ ├── Dockerfile +│ ├── nccl-topologies/ +│ ├── main.go │ └── pkg/ -│ ├── webhook/ # Admission handler -│ └── injection/ # Pod mutation + DRA detection │ -└── checker/ # Init container image - ├── nccl-topologies/ # Built-in topology files +└── nccl-allreduce/ + ├── Dockerfile + ├── nccl-topologies/ + ├── main.go └── pkg/ - ├── checks/ # dcgm + nccl - ├── coordination/ # gang registration + NCCL ID - └── reporting/ # HealthEvent reporting ``` ### Webhook flow @@ -61,13 +77,6 @@ flowchart TD Namespace filtering handled by `namespaceSelector` in webhook config. -### Namespace model - -- NVSentinel Helm chart is installed in `nvsentinel` namespace (webhook Deployment runs there). -- Webhook mutates Pods in *other* namespaces based on `namespaceSelector` (and skips system namespaces). -- The injected init containers run in the workload namespace. -- Any Kubernetes API access needed by the init container (gang coordination ConfigMap + Workload reads) must be granted in the workload namespace (namespace-scoped Role/RoleBinding). This is created by the Helm chart in the opted-in namespaces. - ### MutatingWebhookConfiguration (sketch) ```yaml @@ -107,40 +116,40 @@ webhooks: 1. Extended resources (device plugins): check `resources.limits`/`resources.requests` for configured names (e.g. `nvidia.com/gpu`) 2. DRA: check `spec.resourceClaims`, resolve claim/template, match `deviceClassName` against configured list -### Init container spec (sketch) +### Injected init containers (sketch) + +One init container per enabled check: ```yaml initContainers: - - name: nvsentinel-preflight - image: ghcr.io/nvidia/nvsentinel/preflight-checker:v1 + - name: preflight-dcgm-diag + image: ghcr.io/nvidia/nvsentinel/preflight-dcgm-diag:v1 env: - - name: PREFLIGHT_CHECKS - value: "dcgm-diag,nccl-loopback" - name: DCGM_DIAG_LEVEL value: "1" - - name: CHECK_TIMEOUT - value: "300s" - - name: GANG_TIMEOUT - value: "600s" + - name: DCGM_HOSTENGINE_ADDR + value: "dcgm-hostengine.nvsentinel.svc:5555" - name: PLATFORM_CONNECTOR_SOCKET value: "unix:///var/run/nvsentinel.sock" - - name: MY_POD_NAME - valueFrom: - fieldRef: - fieldPath: metadata.name - - name: MY_POD_IP - valueFrom: - fieldRef: - fieldPath: status.podIP resources: limits: - nvidia.com/gpu: 8 # Max across all containers - nvidia.com/mlnxnics: 4 # Max across all containers (if NCCL enabled) - securityContext: - privileged: true # DCGM diag + nvidia.com/gpu: 8 # Max across all containers + volumeMounts: + - name: platform-connector-socket + mountPath: /var/run + + - name: preflight-nccl-loopback + image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:v1 + env: + - name: NCCL_LOOPBACK_THRESHOLD_GBPS + value: "10.0" + - name: PLATFORM_CONNECTOR_SOCKET + value: "unix:///var/run/nvsentinel.sock" + resources: + limits: + nvidia.com/gpu: 8 + nvidia.com/mlnxnics: 4 volumeMounts: - - name: dcgm-socket - mountPath: /var/run/nvidia - name: platform-connector-socket mountPath: /var/run ``` @@ -153,94 +162,184 @@ initContainers: ## Check types -| Check | Scope | Coordination | -|-------|-------|--------------| -| `dcgm-diag` | Single node | None | -| `nccl-loopback` | Single node | None | -| `nccl-allreduce` | Gang-wide | ConfigMap | -| `plugin:` | Varies | Varies | +| Check | Scope | Coordination | +|------------------|-------------|--------------| +| `dcgm-diag` | Single node | None | +| `nccl-loopback` | Single node | None | +| `nccl-allreduce` | Gang-wide | ConfigMap | -### Plugin Interface (Third-Party Checks) +Third-party checks follow the same pattern: separate image, configured in Helm. -Plugins are separate init containers. Webhook injects one container per plugin. +### DCGM Diag -**Registration:** -```yaml -preflight-injector: - plugins: - - name: bandwidth-check - image: myregistry/bandwidth-check:v1 - timeout: "60s" +Runs DCGM diagnostics on allocated GPUs via remote DCGM hostengine Service. + +**How it works:** +1. Init container gets GPU UUIDs: `nvidia-smi --query-gpu=uuid --format=csv,noheader` +2. Calls DCGM hostengine via Service: `dcgmi diag -r --host $DCGM_HOSTENGINE_ADDR -i ` +3. Parses results, maps failures to HealthEvents + +**Requirements:** +- DCGM hostengine DaemonSet running (privileged, with GPU access) +- DCGM Service exposing hostengine (port 5555) +- NetworkPolicy allowing init container → DCGM Service + +**Diag levels:** +- Level 1 (~30s): Quick hardware validation (memory, PCIe bandwidth) +- Level 2 (~2-3min): Extended tests (stress, targeted diagnostics) + +Init container remains unprivileged; hostengine performs diagnostics. + +### NCCL Loopback + +Tests intra-node GPU-to-GPU communication (NVLink/PCIe paths) without network. + +**How it works:** +1. Init container runs `all_reduce_perf` (from nccl-tests) with all allocated GPUs +2. Command: `all_reduce_perf -b 8 -e 256M -f 2 -g ` +3. Validates bandwidth meets threshold set in Helm values +4. No coordination needed — single node only + +**What it catches:** +- NVLink failures between GPUs +- PCIe bandwidth degradation +- GPU memory errors during collective ops + +**Requirements:** +- GPU allocation (device plugin) +- `nccl-tests` binary in checker image + +**Example output parsing:** +``` +# nccl-tests output format: +# size count type redop time algbw busbw + 8M 2097152 float sum 1.23 6.50 12.19 +``` +Checker validates `busbw` (bus bandwidth) against configured threshold. + +### NCCL All-Reduce (Gang-Wide) + +Tests cross-node GPU collective communication over RDMA/InfiniBand. + +**How it works:** +1. **Gang formation**: All pods register in shared ConfigMap (see Gang Coordination section) +2. **Rank assignment**: Sort pod names alphabetically → rank 0, 1, 2, ... +3. **NCCL bootstrap**: Controller generates NCCL unique ID, writes to ConfigMap +4. **Run test**: Each pod reads ConfigMap and runs `all_reduce_perf` independently + +**Command:** +```bash +NCCL_COMM_ID= \ +NCCL_NRANKS=$WORLD_SIZE \ +NCCL_RANK=$MY_RANK \ +all_reduce_perf -b 8 -e 256M -f 2 -g $GPUS_PER_NODE ``` -**Injected init containers:** +Each init container runs independently. NCCL handles cross-node coordination via the shared `NCCL_COMM_ID`. + +**What it catches:** +- InfiniBand/RDMA link failures +- Network topology misconfigurations +- Cross-node NVLink (when present) +- NCCL algorithm/protocol issues + +**Requirements:** +- `workloadRef` for gang discovery (K8s 1.35+) +- Network device allocation (InfiniBand NICs) +- NCCL topology file (auto-detected or user-provided) +- ConfigMap RBAC for coordination + +**Timeout handling:** +- `GANG_TIMEOUT` sets max wait for all peers to register +- If timeout expires before gang forms → exit with `isFatal: false` (not a hardware issue) + +### Third-Party Checks + +Third-party checks follow the same pattern as built-in checks. Register in Helm: + ```yaml -initContainers: - # Built-in checks - - name: nvsentinel-preflight - image: ghcr.io/nvidia/nvsentinel/preflight-checker:v1 - ... - - # Plugin (separate container) - - name: preflight-bandwidth-check - image: myregistry/bandwidth-check:v1 - env: - - name: CHECK_TIMEOUT - value: "60s" - - name: NODE_NAME - valueFrom: - fieldRef: - fieldPath: spec.nodeName +preflight-injector: + checks: + - name: dcgm-diag + image: ghcr.io/nvidia/nvsentinel/preflight-dcgm-diag:v1 + - name: nccl-loopback + image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:v1 + - name: bandwidth-check # third-party + image: myregistry/bandwidth-check:v1 ``` -**Plugin contract:** +**Check contract:** - Exit codes: `0` (passed), `1` (check failed), `2` (config error) - Report failures via gRPC to Platform Connector: - - Unix socket: `unix:///var/run/nvsentinel.sock` (matches global `socketPath`) - - Use `HealthEventOccurredV1` RPC (service `PlatformConnector`, proto `data-models/protobufs/health_event.proto`) - - Plugin sets `isFatal`, `recommendedAction`, `errorCode` in HealthEvent - - Platform Connector overrides can modify these values via CEL rules -- Webhook mounts required volumes: GPU devices, DCGM socket, Platform Connector socket + - Unix socket: `unix:///var/run/nvsentinel.sock` + - RPC: `HealthEventOccurredV1` (proto: `data-models/protobufs/health_event.proto`) + - Set `isFatal`, `recommendedAction`, `errorCode` in HealthEvent +- Webhook mounts: GPU devices, Platform Connector socket, network devices ### Configuration Configured at deployment time via Helm values. No per-workload annotations. +### Gang Discovery + +Gang discovery is pluggable. Given one pod, return all pods in the gang. + +**Interface:** +```go +type GangDiscoverer interface { + DiscoverGang(pod *corev1.Pod) ([]PeerInfo, error) +} + +type PeerInfo struct { + PodName string + PodIP string + NodeName string +} +``` + +**Implementations:** + +| Scheduler | Discovery chain | +|-----------------|--------------------------------------------------------------------------| +| K8s 1.35 native | Pod → `spec.workloadRef` → list pods with same ref | +| Volcano | Pod → `volcano.sh/pod-group` annotation → list pods with same annotation | +| Kueue | Pod → `kueue.x-k8s.io/workload-name` label → list pods with same label | +| Label-based | Pod → configurable labels → list pods with same labels | + +Controller selects implementation based on Helm config. If no gang identifier found, pod is treated as singleton (skip gang-wide tests). + ### Gang Coordination -For gang-wide checks like `nccl-allreduce`, pods discover peers via ConfigMap registration: +For gang-wide checks like `nccl-allreduce`, the preflight controller maintains a ConfigMap with peer registration and NCCL bootstrap data. Pods only read it. ```mermaid sequenceDiagram - participant W as Webhook + participant C as Preflight Controller participant P0 as Pod 0 Init participant P1 as Pod 1 Init participant API as Kube API participant CM as ConfigMap - Note over W: First pod in gang - W->>API: Create ConfigMap (expected=2, peers="") - - P0->>API: Patch ConfigMap: add pod-0:10.0.1.5 - P1->>API: Patch ConfigMap: add pod-1:10.0.1.6 - - P0->>API: Poll until len(peers) == expected - P1->>API: Poll until len(peers) == expected - + C->>API: Create/Update ConfigMap (expected=2, peers="") + C->>API: Update ConfigMap: add pod-0:10.0.1.5 + C->>API: Update ConfigMap: add pod-1:10.0.1.6 + C->>API: Update ConfigMap: set nccl_unique_id + + P0->>API: Read ConfigMap until len(peers) == expected + P1->>API: Read ConfigMap until len(peers) == expected + Note over P0,P1: Determine rank by sorting pod names - - P0->>CM: Update with NCCL unique ID - P1->>CM: Read NCCL unique ID - + P0->>P1: nccl.init() (barrier inside NCCL) P0->>P1: nccl.all_reduce() ``` -**Peer registration (no pod listing):** -- Webhook idempotently creates ConfigMap named `preflight--` with `expected_count` -- Each init container patches ConfigMap to add its IP -- Init containers poll until all peers register -- Determines rank by sorting pod names alphabetically +**Peer registration (controller-managed):** +- Preflight controller creates/updates ConfigMap `preflight-` with `expected_count` +- `gangID` derived from gang discoverer (e.g., `workload-name/pod-group`, `volcano-pg-name`, `kueue-workload-name`) +- Controller watches pods in the gang and updates `peers` and `nccl_unique_id` +- Init containers read/poll ConfigMap until all peers are registered +- Each pod determines rank by sorting pod names alphabetically **ConfigMap structure:** ```yaml @@ -257,35 +356,50 @@ data: peers: | pod-0:10.0.1.5 pod-1:10.0.1.6 - nccl_unique_id: "base64..." # Added by rank 0 + nccl_unique_id: "base64..." # Added by controller ``` -**Security:** Init containers have minimal RBAC (get/patch ConfigMap, get Workload). No pod list permission. +**Security:** Init containers read the ConfigMap only. Controller owns write access. **Gang coordination timeout:** 10 minutes. If gang doesn't form, init fails with `isFatal: false` (not a hardware issue). -### RBAC (gang coordination) +### RBAC -Use a namespace-scoped Role for coordination. Kubernetes RBAC does not support label-based restrictions for ConfigMaps, so the checker enforces scope in code (expected ConfigMap name + required labels/ownerRef). +Controller needs write access; init containers only need read. Both use ClusterRole since they operate across workload namespaces. +**Controller ClusterRole:** ```yaml rules: - - apiGroups: ["scheduling.k8s.io"] - resources: ["workloads"] - verbs: ["get"] - apiGroups: [""] resources: ["configmaps"] verbs: ["get", "create", "patch"] + - apiGroups: [""] + resources: ["pods"] + verbs: ["get", "list", "watch"] + # Additional rules based on gang discoverer: + # workloadRef: scheduling.k8s.io/workloads (get) + # Volcano: scheduling.volcano.sh/podgroups (get) + # Kueue: kueue.x-k8s.io/workloads (get) ``` -Checker only reads/writes the coordination ConfigMap `preflight--` in its own namespace. +Controller only touches ConfigMaps with `preflight-` prefix (enforced by code). + +**Init container ClusterRole:** +```yaml +rules: + - apiGroups: [""] + resources: ["configmaps"] + verbs: ["get"] +``` + +Init containers poll ConfigMap until all peers are registered. ### DRA Integration For pods using Dynamic Resource Allocation (DRA), the webhook copies resource claim references to the init container. **Device claim detection:** -Webhook checks pod's `spec.resourceClaims`, retrieves each ResourceClaim or ResourceClaimTemplate, and matches `deviceClassName` against configurable lists for GPUs and network devices: +Webhook checks pod's `spec.resourceClaims`, retrieves each ResourceClaim or ResourceClaimTemplate, and matches `deviceClassName` against configured lists for GPUs and network devices: ```yaml # Helm values @@ -329,30 +443,6 @@ spec: - name: rdma-claim ``` -**Multiple containers with GPUs:** -```yaml -# Extended resources example -containers: - - name: trainer - resources: - limits: - nvidia.com/gpu: 4 - nvidia.com/mlnxnics: 2 - - name: validator - resources: - limits: - nvidia.com/gpu: 8 - nvidia.com/mlnxnics: 4 - -# Init container gets max(4, 8) = 8 GPUs, max(2, 4) = 4 NICs -initContainers: - - name: nvsentinel-preflight - resources: - limits: - nvidia.com/gpu: 8 - nvidia.com/mlnxnics: 4 -``` - **Detection logic:** 1. Check if pod uses extended resources (`nvidia.com/gpu`, `nvidia.com/mlnxnics`) → inject with max counts across all containers 2. Check if pod has DRA claims with matching `deviceClassName` → inject with all unique GPU and network claim references @@ -382,40 +472,6 @@ If `nccl-loopback` or `nccl-allreduce` is enabled, webhook: 2. Scans all container env vars, copies those matching `ncclEnvPatterns` (glob patterns from Helm config) 3. Copies volume mounts referenced by `NCCL_TOPO_FILE` (if present) -**Example: How env vars are copied** - -Main container has: -```yaml -env: - - name: NCCL_TOPO_FILE - value: /etc/nccl/topo.xml - - name: NCCL_IB_PCI_RELAXED_ORDERING - value: "1" - - name: NCCL_SOCKET_IFNAME - value: eth0 - - name: MY_APP_CONFIG - value: /app/config.yaml - - name: OMPI_MCA_btl - value: openib -``` - -Webhook with `ncclEnvPatterns: ["NCCL_*", "OMPI_*"]` copies to init container: -```yaml -env: - - name: NCCL_TOPO_FILE # Matches NCCL_* - value: /etc/nccl/topo.xml - - name: NCCL_IB_PCI_RELAXED_ORDERING # Matches NCCL_* - value: "1" - - name: NCCL_SOCKET_IFNAME # Matches NCCL_* - value: eth0 - - name: OMPI_MCA_btl # Matches OMPI_* - value: openib - # MY_APP_CONFIG NOT copied (doesn't match patterns) -volumeMounts: - - name: nccl-topology # Copied because NCCL_TOPO_FILE references it - mountPath: /etc/nccl -``` - **NCCL topology file handling:** The init container image includes common topology files for major cloud platforms: ``` @@ -453,23 +509,24 @@ HealthEvent feeds into existing NVSentinel workflow (quarantine, correlation, et **DCGM Diag** : -| Test | Result | Recommended Action | -|------|--------|-------------------| -| Memory | `FAIL` | `CONTACT_SUPPORT` | -| PCIe | `FAIL` | `CONTACT_SUPPORT` | -| NVLink | `FAIL` | `CONTACT_SUPPORT` | -| Stress | `FAIL` | `RUN_DCGMEUD` | -| Any | `WARN` | `NONE` | +| Test | Result | Recommended Action | +|--------|--------|--------------------| +| Memory | `FAIL` | `CONTACT_SUPPORT` | +| PCIe | `FAIL` | `CONTACT_SUPPORT` | +| NVLink | `FAIL` | `CONTACT_SUPPORT` | +| Stress | `FAIL` | `RUN_DCGMEUD` | +| Any | `WARN` | `NONE` | + **NCCL Checks**: -| Error | Recommended Action | -|-------|-------------------| -| `NCCL_SYSTEM_ERROR` | `CONTACT_SUPPORT` | -| `NCCL_INTERNAL_ERROR` | `RUN_DCGMEUD` | -| `NCCL_INVALID_USAGE` | `NONE` | -| `NCCL_TIMEOUT` | `NONE` | -| `NCCL_REMOTE_ERROR` | `CONTACT_SUPPORT` | +| Error | Recommended Action | +|-----------------------|--------------------| +| `NCCL_SYSTEM_ERROR` | `CONTACT_SUPPORT` | +| `NCCL_INTERNAL_ERROR` | `RUN_DCGMEUD` | +| `NCCL_INVALID_USAGE` | `NONE` | +| `NCCL_TIMEOUT` | `NONE` | +| `NCCL_REMOTE_ERROR` | `CONTACT_SUPPORT` | **isFatal determination**: - DCGM diag `FAIL` → `isFatal: true` @@ -477,33 +534,6 @@ HealthEvent feeds into existing NVSentinel workflow (quarantine, correlation, et - NCCL hardware errors (`SYSTEM_ERROR`, `INTERNAL_ERROR`, `REMOTE_ERROR`) → `isFatal: true` - NCCL timeout/config errors → `isFatal: false` -### Integration with Node Drainer - -Preflight failures quarantine nodes without draining. Rationale: -- Workload never started → no pods to evict -- Draining would disrupt other gang members waiting for coordination -- Quarantine prevents new scheduling while remediation happens - -**Platform Connector override:** -```yaml -pipeline: - overrides: - - match: - agent: "preflight-checker" - override: - drainOverrides: - skip: true -``` - -**Flow:** -1. Preflight fails → HealthEvent with `isFatal: true` -2. Platform Connector applies override → `drainOverrides.skip: true` -3. Node drainer sees `skip: true` → quarantines node (taint), skips drain -4. Fault Remediation runs based on `recommendedAction` (EUD, support ticket, etc.) -5. Remediation succeeds → taint removed → node back in rotation - -Gang members on other nodes timeout after `gangTimeout`, fail with `isFatal: false` (coordination failure, not hardware), no quarantine. - ### Helm Values ```yaml @@ -511,14 +541,35 @@ preflight-injector: enabled: false # Opt-in checks: - - dcgm-diag - - nccl-loopback - # - nccl-allreduce # Enable for gang workloads + - name: dcgm-diag + image: ghcr.io/nvidia/nvsentinel/preflight-dcgm-diag:v1 + - name: nccl-loopback + image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:v1 + # - name: nccl-allreduce + # image: ghcr.io/nvidia/nvsentinel/preflight-nccl-allreduce:v1 + + # DCGM configuration + dcgm: + hostengineAddr: "dcgm-hostengine.nvsentinel.svc:5555" # DCGM Service address + diagLevel: 1 # 1 (quick, ~30s) or 2 (extended, ~2-3min) + + # NCCL test configuration + nccl: + loopbackThresholdGBps: 10.0 # Min bus bandwidth for loopback pass + allreduceThresholdGBps: 5.0 # Min bus bandwidth for all-reduce pass - dcgmDiagLevel: 1 # 1 (quick, ~30s) or 2 (medium, ~2-3min) checkTimeout: "300s" # Per-check timeout gangTimeout: "600s" # Gang coordination timeout + # Gang discovery configuration + gangDiscovery: + # Options: workloadRef, volcano, kueue, labels + method: "workloadRef" + # For label-based discovery: + # labels: + # gangIdLabel: "app.kubernetes.io/gang-id" + # gangSizeLabel: "app.kubernetes.io/gang-size" + # GPU detection configuration gpuDetection: # Extended resources (current approach) @@ -533,10 +584,9 @@ preflight-injector: # Network device resources (for NCCL tests) networkDetection: - # Extended resources + # Extended resources (cluster-specific, configure for your environment) resourceNames: - - "nvidia.com/mlnxnics" - - "rdma/hca" + - "nvidia.com/mlnxnics" # Mellanox/NVIDIA InfiniBand NICs # Add other network device plugin resources used in your cluster # DRA device classes (if using DRA for network devices) @@ -578,58 +628,59 @@ preflight-injector: webhook: failurePolicy: Fail # or Ignore - - image: - repository: ghcr.io/nvidia/nvsentinel/preflight-checker - tag: v1 ``` All GPU pods in listed namespaces get the configured checks. ### Metrics -**preflight/checker** (exposed via pushgateway or scraped from pod annotations): +**Check containers** (exposed via pushgateway or scraped from pod annotations): + +| Metric | Type | Labels | +|------------------------------------|-----------|-------------------------------| +| `preflight_check_total` | Counter | `check`, `result` | +| `preflight_check_duration_seconds` | Histogram | `check` | +| `preflight_check_failures_total` | Counter | `check`, `node`, `error_code` | +| `preflight_gang_wait_seconds` | Histogram | `workload` | +| `preflight_config_errors_total` | Counter | `error` | -| Metric | Type | Labels | -|--------|------|--------| -| `preflight_check_total` | Counter | `check`, `result` | -| `preflight_check_duration_seconds` | Histogram | `check` | -| `preflight_check_failures_total` | Counter | `check`, `node`, `error_code` | -| `preflight_gang_wait_seconds` | Histogram | `workload` | -| `preflight_config_errors_total` | Counter | `error` | **preflight/injector** (standard Prometheus endpoint): -| Metric | Type | Labels | -|--------|------|--------| -| `preflight_injection_total` | Counter | `result` | -| `preflight_webhook_latency_seconds` | Histogram | - | +| Metric | Type | Labels | +|-------------------------------------|-----------|----------| +| `preflight_injection_total` | Counter | `result` | +| `preflight_webhook_latency_seconds` | Histogram | - | + ## Rationale -- Mutating webhook, no external dependencies -- Init containers +- Mutating webhook for transparent injection +- Non-privileged init containers (DCGM diag runs via remote hostengine) - Namespace selector opt-in -- Deployment-level config +- Deployment-level config (no per-workload changes) ## Consequences ### Positive - Catches GPU failures before workload starts - Works with any workload controller +- Unprivileged init container (uses DCGM hostengine) - Built-in NCCL topology files for major cloud platforms ### Negative - Adds 30-60s pod startup latency (DCGM diag level 1) -- Requires privileged init container for DCGM +- Requires DCGM hostengine DaemonSet for diag checks - Webhook downtime blocks pod creation (if `failurePolicy: Fail`) - NCCL tests require network device plugins (InfiniBand/RDMA) to be configured +- Gang-wide NCCL tests require K8s 1.35+ (`workloadRef`) ### Mitigations - **Latency**: Use DCGM level 1 (~30s) vs level 2 (~2-3min); skip expensive checks for non-critical workloads -- **Privileged**: Required for hardware access; limit to specific namespaces +- **DCGM dependency**: Most GPU clusters already run DCGM for monitoring; expose as Service - **Webhook availability**: HA deployment (replicas, PDB); `failurePolicy: Ignore` for graceful degradation - **Network resources**: NCCL tests skipped if network devices unavailable; DCGM diag runs regardless +- **K8s version**: NCCL loopback (single-node) works without `workloadRef`; gang tests are opt-in ## Alternatives Considered From bead7f41538b226eed785e93304055d431d13862 Mon Sep 17 00:00:00 2001 From: Ajay Mishra Date: Fri, 16 Jan 2026 17:45:17 +0530 Subject: [PATCH 08/11] chore: address review comments Signed-off-by: Ajay Mishra --- docs/designs/026-preflight-checks.md | 94 +++++++++++++--------------- 1 file changed, 42 insertions(+), 52 deletions(-) diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md index 7a2a24caa..de9de858f 100644 --- a/docs/designs/026-preflight-checks.md +++ b/docs/designs/026-preflight-checks.md @@ -33,20 +33,18 @@ Implement a MutatingAdmissionWebhook that injects preflight check init container ### Components -Each check is a separate image. Webhook injects one init container per enabled check. - ``` preflight/ -├── injector/ -│ └── pkg/ -│ ├── webhook/ -│ └── injection/ -│ -├── controller/ -│ └── pkg/ -│ ├── gang/ -│ └── coordination/ -│ +└── controller/ # Webhook + gang controller (controller-runtime) + ├── Dockerfile + ├── main.go + └── pkg/ + ├── webhook/ # Admission handler + ├── injection/ # Pod mutation, DRA detection + ├── gang/ # Gang discovery implementations + └── coordination/ # ConfigMap management + +preflight-checks/ ├── dcgm-diag/ │ ├── Dockerfile │ ├── main.go @@ -138,13 +136,17 @@ initContainers: - name: platform-connector-socket mountPath: /var/run - - name: preflight-nccl-loopback - image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:v1 + - name: preflight-nccl-allreduce + image: ghcr.io/nvidia/nvsentinel/preflight-nccl-allreduce:v1 env: - - name: NCCL_LOOPBACK_THRESHOLD_GBPS - value: "10.0" - - name: PLATFORM_CONNECTOR_SOCKET - value: "unix:///var/run/nvsentinel.sock" + - name: NCCL_ALLREDUCE_THRESHOLD_GBPS + value: "5.0" + - name: GANG_TIMEOUT + value: "600s" + - name: MY_POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name resources: limits: nvidia.com/gpu: 8 @@ -152,6 +154,8 @@ initContainers: volumeMounts: - name: platform-connector-socket mountPath: /var/run + - name: preflight-gang-config # ConfigMap mounted as volume + mountPath: /etc/preflight ``` ### Resource handling @@ -310,36 +314,36 @@ Controller selects implementation based on Helm config. If no gang identifier fo ### Gang Coordination -For gang-wide checks like `nccl-allreduce`, the preflight controller maintains a ConfigMap with peer registration and NCCL bootstrap data. Pods only read it. +For gang-wide checks like `nccl-allreduce`, the preflight controller maintains a ConfigMap. Webhook mounts it as a volume; init containers read from filesystem. ```mermaid sequenceDiagram participant C as Preflight Controller + participant K as Kubelet participant P0 as Pod 0 Init participant P1 as Pod 1 Init - participant API as Kube API - participant CM as ConfigMap - C->>API: Create/Update ConfigMap (expected=2, peers="") - C->>API: Update ConfigMap: add pod-0:10.0.1.5 - C->>API: Update ConfigMap: add pod-1:10.0.1.6 - C->>API: Update ConfigMap: set nccl_unique_id + C->>C: Create ConfigMap (expected=2) + C->>C: Update ConfigMap: add pod-0:10.0.1.5 + C->>C: Update ConfigMap: add pod-1:10.0.1.6 + C->>C: Update ConfigMap: set nccl_unique_id + + K->>P0: Sync ConfigMap to volume + K->>P1: Sync ConfigMap to volume - P0->>API: Read ConfigMap until len(peers) == expected - P1->>API: Read ConfigMap until len(peers) == expected + P0->>P0: Read /etc/preflight/peers until len == expected + P1->>P1: Read /etc/preflight/peers until len == expected Note over P0,P1: Determine rank by sorting pod names - P0->>P1: nccl.init() (barrier inside NCCL) - P0->>P1: nccl.all_reduce() + P0->>P1: nccl.init() + nccl.all_reduce() ``` -**Peer registration (controller-managed):** -- Preflight controller creates/updates ConfigMap `preflight-` with `expected_count` -- `gangID` derived from gang discoverer (e.g., `workload-name/pod-group`, `volcano-pg-name`, `kueue-workload-name`) -- Controller watches pods in the gang and updates `peers` and `nccl_unique_id` -- Init containers read/poll ConfigMap until all peers are registered -- Each pod determines rank by sorting pod names alphabetically +**Flow:** +1. Controller creates/updates ConfigMap `preflight-` with `expected_count`, `peers`, `nccl_unique_id` +2. Webhook mounts ConfigMap as volume at `/etc/preflight/` +3. Init containers poll filesystem until all peers registered (kubelet syncs ~1 min) +4. Each pod determines rank by sorting pod names alphabetically **ConfigMap structure:** ```yaml @@ -347,25 +351,21 @@ apiVersion: v1 kind: ConfigMap metadata: name: preflight-myworkload-group1 - ownerReferences: - - apiVersion: scheduling.k8s.io/v1alpha1 - kind: Workload - name: myworkload data: expected_count: "2" peers: | pod-0:10.0.1.5 pod-1:10.0.1.6 - nccl_unique_id: "base64..." # Added by controller + nccl_unique_id: "base64..." ``` -**Security:** Init containers read the ConfigMap only. Controller owns write access. +**Benefits:** Init containers need no RBAC — just read files. **Gang coordination timeout:** 10 minutes. If gang doesn't form, init fails with `isFatal: false` (not a hardware issue). ### RBAC -Controller needs write access; init containers only need read. Both use ClusterRole since they operate across workload namespaces. +Only the controller needs RBAC. Init containers read from mounted volume (no API access). **Controller ClusterRole:** ```yaml @@ -384,16 +384,6 @@ rules: Controller only touches ConfigMaps with `preflight-` prefix (enforced by code). -**Init container ClusterRole:** -```yaml -rules: - - apiGroups: [""] - resources: ["configmaps"] - verbs: ["get"] -``` - -Init containers poll ConfigMap until all peers are registered. - ### DRA Integration For pods using Dynamic Resource Allocation (DRA), the webhook copies resource claim references to the init container. From 63ae2ec34227e20ebe611950b72608267f41f720 Mon Sep 17 00:00:00 2001 From: Ajay Mishra Date: Mon, 19 Jan 2026 10:11:31 +0530 Subject: [PATCH 09/11] chore: address review comments Signed-off-by: Ajay Mishra --- docs/designs/026-preflight-checks.md | 62 +++++++++++++++++----------- 1 file changed, 38 insertions(+), 24 deletions(-) diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md index de9de858f..eb921a07d 100644 --- a/docs/designs/026-preflight-checks.md +++ b/docs/designs/026-preflight-checks.md @@ -116,7 +116,7 @@ webhooks: ### Injected init containers (sketch) -One init container per enabled check: +One init container per enabled check with be prepended to the pod's init containers: ```yaml initContainers: @@ -154,7 +154,7 @@ initContainers: volumeMounts: - name: platform-connector-socket mountPath: /var/run - - name: preflight-gang-config # ConfigMap mounted as volume + - name: preflight-gang-config # ConfigMap: peers, master_addr, rank, world_size mountPath: /etc/preflight ``` @@ -227,19 +227,36 @@ Tests cross-node GPU collective communication over RDMA/InfiniBand. **How it works:** 1. **Gang formation**: All pods register in shared ConfigMap (see Gang Coordination section) -2. **Rank assignment**: Sort pod names alphabetically → rank 0, 1, 2, ... -3. **NCCL bootstrap**: Controller generates NCCL unique ID, writes to ConfigMap -4. **Run test**: Each pod reads ConfigMap and runs `all_reduce_perf` independently - -**Command:** -```bash -NCCL_COMM_ID= \ -NCCL_NRANKS=$WORLD_SIZE \ -NCCL_RANK=$MY_RANK \ -all_reduce_perf -b 8 -e 256M -f 2 -g $GPUS_PER_NODE +2. **Wait for peers**: Each init container polls ConfigMap (mounted volume) until all peers registered +3. **Bootstrap via TCP**: Rank 0's IP from ConfigMap; PyTorch/NCCL handles handshake +4. **Run test**: Each init container runs PyTorch all-reduce independently; NCCL coordinates internally + +**Test script (PyTorch-based, no MPI needed):** +```python +import torch +import torch.distributed as dist +import os + +# Read from mounted ConfigMap +rank = int(os.environ['MY_RANK']) +world_size = int(os.environ['WORLD_SIZE']) +master_addr = os.environ['MASTER_ADDR'] # Rank 0's IP from ConfigMap + +# PyTorch handles NCCL bootstrap via TCP +dist.init_process_group( + backend='nccl', + init_method=f'tcp://{master_addr}:29500', + rank=rank, + world_size=world_size +) + +# Run all-reduce test, measure bandwidth +tensor = torch.ones(256 * 1024 * 1024, device='cuda') # 1GB +dist.all_reduce(tensor) +# ... measure time, calculate bandwidth, compare to threshold ``` -Each init container runs independently. NCCL handles cross-node coordination via the shared `NCCL_COMM_ID`. +Each init container runs independently. **What it catches:** - InfiniBand/RDMA link failures @@ -248,10 +265,9 @@ Each init container runs independently. NCCL handles cross-node coordination via - NCCL algorithm/protocol issues **Requirements:** -- `workloadRef` for gang discovery (K8s 1.35+) +- Gang discovery (`workloadRef`, Volcano, or Kueue) - Network device allocation (InfiniBand NICs) - NCCL topology file (auto-detected or user-provided) -- ConfigMap RBAC for coordination **Timeout handling:** - `GANG_TIMEOUT` sets max wait for all peers to register @@ -323,10 +339,9 @@ sequenceDiagram participant P0 as Pod 0 Init participant P1 as Pod 1 Init - C->>C: Create ConfigMap (expected=2) + C->>C: Create ConfigMap (expected=2, master_addr=10.0.1.5) C->>C: Update ConfigMap: add pod-0:10.0.1.5 C->>C: Update ConfigMap: add pod-1:10.0.1.6 - C->>C: Update ConfigMap: set nccl_unique_id K->>P0: Sync ConfigMap to volume K->>P1: Sync ConfigMap to volume @@ -336,14 +351,17 @@ sequenceDiagram Note over P0,P1: Determine rank by sorting pod names - P0->>P1: nccl.init() + nccl.all_reduce() + P0->>P0: PyTorch init (rank=0, listens on :29500) + P1->>P0: PyTorch init (rank=1, connects to master_addr:29500) + P0->>P1: NCCL all_reduce over RDMA ``` **Flow:** -1. Controller creates/updates ConfigMap `preflight-` with `expected_count`, `peers`, `nccl_unique_id` +1. Controller creates/updates ConfigMap `preflight-` with `expected_count`, `peers`, `master_addr` 2. Webhook mounts ConfigMap as volume at `/etc/preflight/` 3. Init containers poll filesystem until all peers registered (kubelet syncs ~1 min) 4. Each pod determines rank by sorting pod names alphabetically +5. PyTorch connects to `master_addr` for NCCL bootstrap (TCP), then NCCL uses RDMA **ConfigMap structure:** ```yaml @@ -353,20 +371,16 @@ metadata: name: preflight-myworkload-group1 data: expected_count: "2" + master_addr: "10.0.1.5" # Rank 0's IP for PyTorch TCP bootstrap peers: | pod-0:10.0.1.5 pod-1:10.0.1.6 - nccl_unique_id: "base64..." ``` -**Benefits:** Init containers need no RBAC — just read files. - **Gang coordination timeout:** 10 minutes. If gang doesn't form, init fails with `isFatal: false` (not a hardware issue). ### RBAC -Only the controller needs RBAC. Init containers read from mounted volume (no API access). - **Controller ClusterRole:** ```yaml rules: From 98664db4614dc26b6577003619599f009e1034e5 Mon Sep 17 00:00:00 2001 From: Ajay Mishra Date: Mon, 19 Jan 2026 10:18:25 +0530 Subject: [PATCH 10/11] chore: add overall flow diagram Signed-off-by: Ajay Mishra --- docs/designs/026-preflight-checks.md | 76 +++++++++++++++++++++++++--- 1 file changed, 68 insertions(+), 8 deletions(-) diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md index eb921a07d..5006da405 100644 --- a/docs/designs/026-preflight-checks.md +++ b/docs/designs/026-preflight-checks.md @@ -63,18 +63,78 @@ preflight-checks/ └── pkg/ ``` -### Webhook flow +### Overall flow ```mermaid -flowchart TD - A[Pod CREATE request] --> B{GPU resources?} - B -->|No| C[Allow] - B -->|Yes| D[Inject init containers] - D --> E[Return JSON patch] +stateDiagram-v2 + [*] --> PodCreated: User creates GPU pod + + state "Webhook Injection" as Webhook { + PodCreated --> CheckGPU: Admission webhook triggered + CheckGPU --> Inject: GPU resources detected + CheckGPU --> Skip: No GPU resources + Skip --> [*]: Pod starts normally + Inject --> PodScheduled: Init containers injected + } + + state "Init Container Execution" as InitExec { + PodScheduled --> DCGMDiag: Run dcgm-diag + + state "DCGM Diag" as DCGMDiag { + [*] --> GetGPUUUIDs: nvidia-smi query + GetGPUUUIDs --> RemoteDiag: dcgmi diag via hostengine + RemoteDiag --> DCGMPass: All tests pass + RemoteDiag --> DCGMFail: Test failure + } + + DCGMPass --> NCCLLoopback: Next check + DCGMFail --> ReportFailure: HealthEvent + + state "NCCL Loopback" as NCCLLoopback { + [*] --> RunLoopback: all_reduce_perf -g N + RunLoopback --> CheckBW: Measure bandwidth + CheckBW --> LoopbackPass: BW >= threshold + CheckBW --> LoopbackFail: BW < threshold + } + + LoopbackPass --> GangCheck: Check if gang-wide enabled + LoopbackFail --> ReportFailure + + GangCheck --> NCCLAllReduce: nccl-allreduce enabled + GangCheck --> AllPassed: Single-node only + } + + state "Gang Coordination" as GangCoord { + NCCLAllReduce --> WaitPeers: Poll ConfigMap + WaitPeers --> PeersReady: All peers registered + WaitPeers --> GangTimeout: Timeout (10 min) + GangTimeout --> ReportTimeout: isFatal=false + + state "NCCL All-Reduce" as AllReduce { + PeersReady --> PyTorchInit: TCP bootstrap to master + PyTorchInit --> RunAllReduce: dist.all_reduce() + RunAllReduce --> AllReducePass: BW >= threshold + RunAllReduce --> AllReduceFail: BW < threshold or error + } + + AllReducePass --> AllPassed + AllReduceFail --> ReportFailure + } + + state "Failure Handling" as FailHandle { + ReportFailure --> SendHealthEvent: gRPC to Platform Connector + ReportTimeout --> SendHealthEvent + SendHealthEvent --> PlatformConnector: HealthEvent published + PlatformConnector --> FaultQuarantine: Cordon node + FaultQuarantine --> NodeDrainer: Drain workloads + NodeDrainer --> FaultRemediation: Based on recommendedAction + FaultRemediation --> [*]: Node remediated or escalate + } + + AllPassed --> MainContainerStart: Init success (exit 0) + MainContainerStart --> [*]: Workload runs ``` -Namespace filtering handled by `namespaceSelector` in webhook config. - ### MutatingWebhookConfiguration (sketch) ```yaml From 0066b88671c22910584bc812ead48e71cf0a47ba Mon Sep 17 00:00:00 2001 From: Ajay Mishra Date: Mon, 19 Jan 2026 17:06:13 +0530 Subject: [PATCH 11/11] chore: added reason why pytorch nccl preferred and test result Signed-off-by: Ajay Mishra --- docs/designs/026-preflight-checks.md | 91 ++++++++++++++++++++-------- 1 file changed, 67 insertions(+), 24 deletions(-) diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md index 5006da405..50dbe13aa 100644 --- a/docs/designs/026-preflight-checks.md +++ b/docs/designs/026-preflight-checks.md @@ -285,38 +285,81 @@ Checker validates `busbw` (bus bandwidth) against configured threshold. Tests cross-node GPU collective communication over RDMA/InfiniBand. +**Why PyTorch over MPI:** +- MPI-based tests require `pods/exec` to spawn processes on peer pods +- `pods/exec` is high privilege — allows executing commands in any pod in the namespace +- PyTorch's `torchrun` handles coordination via TCP without cross-pod exec +- Each init container runs independently; NCCL uses RDMA for actual data transfer + **How it works:** 1. **Gang formation**: All pods register in shared ConfigMap (see Gang Coordination section) 2. **Wait for peers**: Each init container polls ConfigMap (mounted volume) until all peers registered -3. **Bootstrap via TCP**: Rank 0's IP from ConfigMap; PyTorch/NCCL handles handshake -4. **Run test**: Each init container runs PyTorch all-reduce independently; NCCL coordinates internally +3. **torchrun bootstrap**: Each pod runs `torchrun` connecting to master (rank 0) via TCP +4. **Single communicator**: All GPUs form one NCCL communicator (e.g., 2 nodes × 8 GPUs = 16 ranks) +5. **Run test**: `dist.all_reduce()` runs across all ranks; NCCL uses RDMA -**Test script (PyTorch-based, no MPI needed):** +**Test script (PyTorch-based):** ```python -import torch +#!/usr/bin/env python3 +""" +NCCL All-Reduce benchmark - single communicator spanning all GPUs. +Env vars set by torchrun: RANK, LOCAL_RANK, WORLD_SIZE +""" +import os, time, torch import torch.distributed as dist -import os - -# Read from mounted ConfigMap -rank = int(os.environ['MY_RANK']) -world_size = int(os.environ['WORLD_SIZE']) -master_addr = os.environ['MASTER_ADDR'] # Rank 0's IP from ConfigMap - -# PyTorch handles NCCL bootstrap via TCP -dist.init_process_group( - backend='nccl', - init_method=f'tcp://{master_addr}:29500', - rank=rank, - world_size=world_size -) - -# Run all-reduce test, measure bandwidth -tensor = torch.ones(256 * 1024 * 1024, device='cuda') # 1GB -dist.all_reduce(tensor) -# ... measure time, calculate bandwidth, compare to threshold + +def benchmark_allreduce(size_bytes, iters=20, warmup=5): + local_rank = int(os.environ.get("LOCAL_RANK", 0)) + tensor = torch.randn(size_bytes // 4, dtype=torch.float32, + device=f"cuda:{local_rank}") + + for _ in range(warmup): + dist.all_reduce(tensor, op=dist.ReduceOp.SUM) + torch.cuda.synchronize() + + start = time.perf_counter() + for _ in range(iters): + dist.all_reduce(tensor, op=dist.ReduceOp.SUM) + torch.cuda.synchronize() + elapsed = time.perf_counter() - start + + world_size = dist.get_world_size() + algo_bw = (size_bytes * iters) / elapsed / 1e9 + bus_bw = algo_bw * (2 * (world_size - 1) / world_size) + return bus_bw + +def main(): + dist.init_process_group(backend="nccl") + torch.cuda.set_device(int(os.environ.get("LOCAL_RANK", 0))) + + bus_bw = benchmark_allreduce(4 * 1024**3) # 4GB + threshold = float(os.environ.get("BW_THRESHOLD_GBPS", "100")) + + if dist.get_rank() == 0 and bus_bw < threshold: + # Report failure to Platform Connector via gRPC + ... + + dist.destroy_process_group() + +if __name__ == "__main__": + main() +``` + +**Invocation (per pod):** +```bash +torchrun --nnodes=$NNODES --nproc_per_node=$GPUS_PER_NODE \ + --node_rank=$MY_RANK --master_addr=$MASTER_ADDR --master_port=29500 \ + /scripts/bench.py ``` -Each init container runs independently. +Each pod runs `torchrun` independently. No MPI, no `pods/exec`, no special RBAC. + +**Benchmark results (Azure NDv4, A100):** + +| Nodes | MPI-based (GB/s) | PyTorch (GB/s) | +|-------|------------------|----------------| +| 2 | 164 | 169 | +| 3 | 160 | 168 | **What it catches:** - InfiniBand/RDMA link failures