From f3ce511028b95e99eddb93361b5b16c075bb3602 Mon Sep 17 00:00:00 2001
From: Ajay Mishra <ajmishra@nvidia.com>
Date: Tue, 6 Jan 2026 15:10:05 +0530
Subject: [PATCH 01/11] docs: add design doc for preflight check

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
---
 docs/designs/026-preflight-checks.md | 332 +++++++++++++++++++++++++++
 1 file changed, 332 insertions(+)
 create mode 100644 docs/designs/026-preflight-checks.md

diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md
new file mode 100644
index 000000000..1f6cb9167
--- /dev/null
+++ b/docs/designs/026-preflight-checks.md
@@ -0,0 +1,332 @@
+# ADR-026: Feature — Preflight Checks via Init Container Injection
+
+## Context
+
+Kubernetes 1.35 introduced `spec.workloadRef` for gang scheduling. GPU failures during distributed training waste compute time. Running diagnostics before the workload starts catches bad GPUs early.
+
+### Distinction from Health Monitors
+
+NVSentinel already has health monitors (GPU Health Monitor, Syslog Health Monitor) that detect GPU issues. This is different:
+
+| | Health Monitors | Preflight Checks |
+|-|-----------------|------------------|
+| When | Continuous (DaemonSet) | Once at pod start (init container) |
+| Check type | Passive (health watches, syslog parsing) | Active diagnostics (DCGM diag) |
+| Detects | Failures as they occur (XID errors, ECC, thermal) | Latent issues before starting |
+| NCCL tests | No | Yes |
+| Purpose | Reactive remediation | Prevent bad starts |
+
+Preflight asks "is this GPU healthy enough to start?" Health monitors ask "did this GPU fail while running?"
+
+## Decision
+
+Implement a MutatingAdmissionWebhook that injects preflight check init containers into pods that have `spec.workloadRef`.
+
+## Implementation
+
+### Component Structure
+
+```
+preflight-injector/
+├── main.go
+├── go.mod
+├── go.sum
+├── Makefile
+├── Tiltfile
+└── pkg/
+    ├── config/
+    │   ├── config.go
+    │   └── config_test.go
+    ├── webhook/
+    │   └── v1alpha1/
+    │       ├── handler.go          # Admission handler
+    │       └── handler_test.go
+    ├── injection/
+    │   ├── injector.go             # Init container construction
+    │   └── injector_test.go
+    ├── coordination/
+    │   ├── discovery.go            # Peer discovery via workloadRef
+    │   └── configmap.go            # NCCL ID ConfigMap management
+    └── metrics/
+        └── metrics.go
+```
+
+### Webhook Flow
+
+```mermaid
+flowchart TD
+    A[Pod CREATE request] --> B{Has GPU resource?}
+    B -->|No| C[Allow - no mutation]
+    B -->|Yes| D[Inject init container]
+    D --> E[Return JSON patch]
+```
+
+Namespace filtering handled by `namespaceSelector` in webhook config. Checks configured at deployment time.
+
+### MutatingWebhookConfiguration
+
+```yaml
+apiVersion: admissionregistration.k8s.io/v1
+kind: MutatingWebhookConfiguration
+metadata:
+  name: preflight-injector
+webhooks:
+  - name: preflight.nvsentinel.nvidia.com
+    clientConfig:
+      service:
+        name: preflight-injector
+        namespace: nvsentinel
+        path: /mutate-pod
+    rules:
+      - apiGroups: [""]
+        apiVersions: ["v1"]
+        resources: ["pods"]
+        operations: ["CREATE"]
+    namespaceSelector:
+      matchExpressions:
+        - key: kubernetes.io/metadata.name
+          operator: In
+          values: []  # Populated from Helm values
+    failurePolicy: Fail
+    sideEffects: None
+    admissionReviewVersions: ["v1"]
+```
+
+Namespace list populated from Helm values.
+
+### Init Container Spec
+
+```yaml
+initContainers:
+  - name: nvsentinel-preflight
+    image: ghcr.io/nvidia/nvsentinel/preflight-checker:v1
+    env:
+      - name: PREFLIGHT_CHECKS
+        value: "dcgm-diag,nccl-loopback"
+      - name: DCGM_DIAG_LEVEL
+        value: "1"
+      - name: CHECK_TIMEOUT
+        value: "300s"
+      - name: GANG_TIMEOUT
+        value: "600s"
+    resources:
+      limits:
+        nvidia.com/gpu: 8  # Copied from main container
+    securityContext:
+      privileged: true
+    volumeMounts:
+      - name: dcgm-socket
+        mountPath: /var/run/nvidia
+      - name: platform-connector-socket
+        mountPath: /var/run/nvsentinel
+```
+
+**GPU resource handling:** Webhook copies `nvidia.com/gpu` from main container to init container (GPU allocation is per-pod).
+
+### Check Types
+
+| Check | Scope | Coordination |
+|-------|-------|--------------|
+| `dcgm-diag` | Single node | None |
+| `nccl-loopback` | Single node | None |
+| `nccl-allreduce` | Gang-wide | ConfigMap |
+| `plugin:<name>` | Varies | Varies |
+
+### Plugin Interface (Third-Party Checks)
+
+Plugins are separate init containers. Webhook injects one container per plugin.
+
+**Registration:**
+```yaml
+preflight-injector:
+  plugins:
+    - name: bandwidth-check
+      image: myregistry/bandwidth-check:v1
+      timeout: "60s"
+```
+
+**Injected init containers:**
+```yaml
+initContainers:
+  # Built-in checks
+  - name: nvsentinel-preflight
+    image: ghcr.io/nvidia/nvsentinel/preflight-checker:v1
+    ...
+  
+  # Plugin (separate container)
+  - name: preflight-bandwidth-check
+    image: myregistry/bandwidth-check:v1
+    env:
+      - name: CHECK_TIMEOUT
+        value: "60s"
+      - name: NODE_NAME
+        valueFrom:
+          fieldRef:
+            fieldPath: spec.nodeName
+```
+
+**Plugin contract:**
+- Exit `0` on success, non-zero on failure
+- Write HealthEvent to Platform Connector socket (same as built-in checks)
+- Plugin sets `isFatal`, `recommendedAction` in HealthEvent
+- Platform Connector overrides can modify if operator disagrees (existing feature)
+- Webhook mounts same volumes (GPU, DCGM socket, Platform Connector socket)
+
+### Configuration
+
+Configured at deployment time via Helm values. No per-workload annotations.
+
+### Gang Coordination
+
+For gang-wide checks like `nccl-allreduce`, pods discover peers using `workloadRef`:
+
+```mermaid
+sequenceDiagram
+    participant R0 as Rank 0 Init
+    participant R1 as Rank 1 Init
+    participant API as Kube API
+    participant CM as ConfigMap
+
+    R0->>API: List pods with same workloadRef
+    R1->>API: List pods with same workloadRef
+    
+    Note over R0,R1: Determine rank by sorting pod names
+    
+    R0->>CM: Create ConfigMap with NCCL unique ID
+    R1->>CM: Poll until ConfigMap exists
+    R1->>CM: Read NCCL unique ID
+    
+    R0->>R1: nccl.init() (barrier inside NCCL)
+    R0->>R1: nccl.all_reduce()
+```
+
+**Peer discovery via workloadRef:**
+- Init container lists pods where `workloadRef.name` and `workloadRef.podGroup` match
+- Gets peer IPs directly from pod list
+- Determines rank by sorting pod names alphabetically
+
+**NCCL ID sharing:**
+- Rank 0 creates ConfigMap named `preflight-{workload}-{podgroup}`
+- Other ranks poll until ConfigMap exists (10 min timeout)
+- ConfigMap has owner reference to Workload for cleanup
+
+**Webhook just injects the init container.** No Service or other resources needed.
+
+**Gang coordination timeout:** 10 minutes. If gang doesn't form, init fails with `isFatal: false` (not a hardware issue).
+
+### Failure Behavior
+
+Init container exit codes:
+- `0`: All checks passed
+- `1`: Check failed, pod should not start
+- `2`: Configuration error
+
+On failure:
+- Pod stays in `Init:Error` state
+- **HealthEvent created** via Platform Connector (same as health monitors)
+- Kubernetes Event created with failure details
+- Metrics incremented (`preflight_check_failures_total`)
+
+HealthEvent feeds into existing NVSentinel workflow (quarantine, correlation, etc).
+
+### Error to Recommended Action Mapping
+
+**DCGM Diag** :
+
+| Test | Result | Recommended Action |
+|------|--------|-------------------|
+| Memory | `FAIL` | `CONTACT_SUPPORT` |
+| PCIe | `FAIL` | `CONTACT_SUPPORT` |
+| NVLink | `FAIL` | `CONTACT_SUPPORT` |
+| Stress | `FAIL` | `RUN_DCGMEUD` |
+| Any | `WARN` | `NONE` |
+
+**NCCL Checks**:
+
+| Error | Recommended Action |
+|-------|-------------------|
+| `NCCL_SYSTEM_ERROR` | `CONTACT_SUPPORT` |
+| `NCCL_INTERNAL_ERROR` | `RUN_DCGMEUD` |
+| `NCCL_INVALID_USAGE` | `NONE` |
+| `NCCL_TIMEOUT` | `NONE` |
+| `NCCL_REMOTE_ERROR` | `CONTACT_SUPPORT` |
+
+**isFatal determination**:
+- DCGM diag `FAIL` → `isFatal: true`
+- DCGM diag `WARN` → `isFatal: false`
+- NCCL hardware errors (`SYSTEM_ERROR`, `INTERNAL_ERROR`, `REMOTE_ERROR`) → `isFatal: true`
+- NCCL timeout/config errors → `isFatal: false`
+
+### Helm Values
+
+```yaml
+preflight-injector:
+  enabled: false  # Opt-in
+  
+  checks:
+    - dcgm-diag
+    - nccl-loopback
+    # - nccl-allreduce  # Enable for gang workloads
+  
+  dcgmDiagLevel: 1       # 1 (quick, ~30s) or 2 (medium, ~2-3min)
+  checkTimeout: "300s"   # Per-check timeout
+  gangTimeout: "600s"    # Gang coordination timeout
+  
+  # Namespaces where preflight checks apply
+  namespaces:
+    - training
+  
+  webhook:
+    failurePolicy: Fail  # or Ignore
+  
+  image:
+    repository: ghcr.io/nvidia/nvsentinel/preflight-checker
+    tag: v1
+```
+
+All GPU pods in listed namespaces get the configured checks.
+
+## Rationale
+
+- Mutating webhook requires no external dependencies
+- Init containers are native Kubernetes
+- Opt-in via namespace selector
+- Deployment-level config, no user workload changes
+
+## Consequences
+
+### Positive
+- Catches GPU failures before workload starts
+- Works with any workload controller
+- No user workload changes
+
+### Negative
+- Adds 30-60s pod startup latency (DCGM diag)
+- Requires privileged init container
+- Webhook downtime blocks pod creation
+
+### Mitigations
+- `failurePolicy: Ignore` if latency unacceptable
+- Timeout configuration
+- HA deployment (replicas, PDB)
+
+## Alternatives Considered
+
+### Kyverno Policy
+Rejected: External dependency.
+
+### User-managed init containers
+Rejected: No enforcement. Users forget.
+
+### Custom CRD wrapper
+Rejected: Requires changing how workloads are deployed.
+
+## Out of Scope
+
+- **Repeated failure handling**: Health Event Analyzer handles pattern detection on HealthEvents. Preflight just emits events.
+
+## References
+
+- K8s 1.35 Workload API: https://kubernetes.io/blog/2025/12/29/kubernetes-v1-35-introducing-workload-aware-scheduling/
+- GitHub Issue: https://github.com/NVIDIA/NVSentinel/issues/658
+

From 6606d6ba6c0c94261a01dd7fbdb2bc0b4cb753be Mon Sep 17 00:00:00 2001
From: Ajay Mishra <ajmishra@nvidia.com>
Date: Tue, 6 Jan 2026 15:23:13 +0530
Subject: [PATCH 02/11] docs: few changes

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
---
 docs/designs/026-preflight-checks.md | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md
index 1f6cb9167..ee0562681 100644
--- a/docs/designs/026-preflight-checks.md
+++ b/docs/designs/026-preflight-checks.md
@@ -2,7 +2,9 @@
 
 ## Context
 
-Kubernetes 1.35 introduced `spec.workloadRef` for gang scheduling. GPU failures during distributed training waste compute time. Running diagnostics before the workload starts catches bad GPUs early.
+GPU failures during training waste compute time. Running diagnostics before the workload starts catches bad GPUs early.
+
+Kubernetes 1.35 introduced `spec.workloadRef` for gang scheduling, which enables gang-wide checks (NCCL all-reduce across multiple pods).
 
 ### Distinction from Health Monitors
 
@@ -20,7 +22,10 @@ Preflight asks "is this GPU healthy enough to start?" Health monitors ask "did t
 
 ## Decision
 
-Implement a MutatingAdmissionWebhook that injects preflight check init containers into pods that have `spec.workloadRef`.
+Implement a MutatingAdmissionWebhook that injects preflight check init containers into GPU pods (pods requesting `nvidia.com/gpu`) in configured namespaces.
+
+- Injection trigger: GPU resource request + namespace
+- Gang coordination (NCCL all-reduce): Uses `workloadRef` if present, skipped otherwise
 
 ## Implementation
 
@@ -166,10 +171,10 @@ initContainers:
 ```
 
 **Plugin contract:**
-- Exit `0` on success, non-zero on failure
+- Exit codes: `0` (passed), `1` (check failed), `2` (config error)
 - Write HealthEvent to Platform Connector socket (same as built-in checks)
 - Plugin sets `isFatal`, `recommendedAction` in HealthEvent
-- Platform Connector overrides can modify if operator disagrees (existing feature)
+- Platform Connector overrides can modify values
 - Webhook mounts same volumes (GPU, DCGM socket, Platform Connector socket)
 
 ### Configuration
@@ -210,7 +215,7 @@ sequenceDiagram
 - Other ranks poll until ConfigMap exists (10 min timeout)
 - ConfigMap has owner reference to Workload for cleanup
 
-**Webhook just injects the init container.** No Service or other resources needed.
+Webhook injects the init container. No Service or other resources created.
 
 **Gang coordination timeout:** 10 minutes. If gang doesn't form, init fails with `isFatal: false` (not a hardware issue).
 
@@ -288,17 +293,16 @@ All GPU pods in listed namespaces get the configured checks.
 
 ## Rationale
 
-- Mutating webhook requires no external dependencies
-- Init containers are native Kubernetes
-- Opt-in via namespace selector
-- Deployment-level config, no user workload changes
+- Mutating webhook, no external dependencies
+- Init containers
+- Namespace selector opt-in
+- Deployment-level config
 
 ## Consequences
 
 ### Positive
 - Catches GPU failures before workload starts
 - Works with any workload controller
-- No user workload changes
 
 ### Negative
 - Adds 30-60s pod startup latency (DCGM diag)
@@ -323,7 +327,7 @@ Rejected: Requires changing how workloads are deployed.
 
 ## Out of Scope
 
-- **Repeated failure handling**: Health Event Analyzer handles pattern detection on HealthEvents. Preflight just emits events.
+- **Repeated failure handling**: Health Event Analyzer handles pattern detection. Preflight emits events.
 
 ## References
 

From cbc9501923eda94b09eb39474103c55c50eff919 Mon Sep 17 00:00:00 2001
From: Ajay Mishra <ajmishra@nvidia.com>
Date: Tue, 6 Jan 2026 15:24:40 +0530
Subject: [PATCH 03/11] docs: few changes

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
---
 docs/designs/026-preflight-checks.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md
index ee0562681..aee53489a 100644
--- a/docs/designs/026-preflight-checks.md
+++ b/docs/designs/026-preflight-checks.md
@@ -4,7 +4,7 @@
 
 GPU failures during training waste compute time. Running diagnostics before the workload starts catches bad GPUs early.
 
-Kubernetes 1.35 introduced `spec.workloadRef` for gang scheduling, which enables gang-wide checks (NCCL all-reduce across multiple pods).
+Kubernetes 1.35 introduced `spec.workloadRef` for gang scheduling. Preflight can use `workloadRef` to discover peer pods and run gang-wide checks (NCCL all-reduce).
 
 ### Distinction from Health Monitors
 

From 26c804a21dcdb7e3fc1d1a51fd0684878ee6e89e Mon Sep 17 00:00:00 2001
From: Ajay Mishra <ajmishra@nvidia.com>
Date: Tue, 6 Jan 2026 15:44:20 +0530
Subject: [PATCH 04/11] chore: minor changes

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
---
 docs/designs/026-preflight-checks.md | 83 ++++++++++++++++++++--------
 1 file changed, 61 insertions(+), 22 deletions(-)

diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md
index aee53489a..2cc3870d3 100644
--- a/docs/designs/026-preflight-checks.md
+++ b/docs/designs/026-preflight-checks.md
@@ -32,28 +32,48 @@ Implement a MutatingAdmissionWebhook that injects preflight check init container
 ### Component Structure
 
 ```
-preflight-injector/
-├── main.go
-├── go.mod
-├── go.sum
-├── Makefile
-├── Tiltfile
-└── pkg/
-    ├── config/
-    │   ├── config.go
-    │   └── config_test.go
-    ├── webhook/
-    │   └── v1alpha1/
-    │       ├── handler.go          # Admission handler
-    │       └── handler_test.go
-    ├── injection/
-    │   ├── injector.go             # Init container construction
-    │   └── injector_test.go
-    ├── coordination/
-    │   ├── discovery.go            # Peer discovery via workloadRef
-    │   └── configmap.go            # NCCL ID ConfigMap management
-    └── metrics/
-        └── metrics.go
+preflight/
+├── injector/                       # Webhook (Deployment)
+│   ├── main.go
+│   ├── go.mod
+│   ├── Makefile
+│   ├── Tiltfile
+│   └── pkg/
+│       ├── config/
+│       │   └── config.go
+│       ├── webhook/
+│       │   └── v1alpha1/
+│       │       ├── handler.go
+│       │       └── handler_test.go
+│       ├── injection/
+│       │   ├── injector.go
+│       │   └── injector_test.go
+│       └── metrics/
+│           └── metrics.go
+│
+├── checker/                        # Init container image
+│   ├── main.go
+│   ├── go.mod
+│   ├── Makefile
+│   ├── Tiltfile
+│   └── pkg/
+│       ├── runner/
+│       │   └── runner.go
+│       ├── checks/
+│       │   ├── dcgm/
+│       │   │   └── diag.go         # dcgmi diag -r 1/2
+│       │   └── nccl/
+│       │       ├── loopback.go
+│       │       └── allreduce.go
+│       ├── coordination/
+│       │   ├── discovery.go        # Peer discovery via workloadRef
+│       │   └── configmap.go        # NCCL ID sharing
+│       ├── reporting/
+│       │   └── healthevents.go
+│       └── metrics/
+│           └── metrics.go
+│
+└── Makefile                        # Builds both
 ```
 
 ### Webhook Flow
@@ -291,6 +311,25 @@ preflight-injector:
 
 All GPU pods in listed namespaces get the configured checks.
 
+### Metrics
+
+**preflight/checker** (exposed via pushgateway or scraped from pod annotations):
+
+| Metric | Type | Labels |
+|--------|------|--------|
+| `preflight_check_total` | Counter | `check`, `result` |
+| `preflight_check_duration_seconds` | Histogram | `check` |
+| `preflight_check_failures_total` | Counter | `check`, `node`, `error_code` |
+| `preflight_gang_wait_seconds` | Histogram | `workload` |
+| `preflight_config_errors_total` | Counter | `error` |
+
+**preflight/injector** (standard Prometheus endpoint):
+
+| Metric | Type | Labels |
+|--------|------|--------|
+| `preflight_injection_total` | Counter | `result` |
+| `preflight_webhook_latency_seconds` | Histogram | - |
+
 ## Rationale
 
 - Mutating webhook, no external dependencies

From 9de7c2a573b083feb0bd47807fb7f327382147c8 Mon Sep 17 00:00:00 2001
From: Ajay Mishra <ajmishra@nvidia.com>
Date: Tue, 6 Jan 2026 15:46:20 +0530
Subject: [PATCH 05/11] chore: minor changes

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
---
 docs/designs/026-preflight-checks.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md
index 2cc3870d3..d7d9647b2 100644
--- a/docs/designs/026-preflight-checks.md
+++ b/docs/designs/026-preflight-checks.md
@@ -344,14 +344,14 @@ All GPU pods in listed namespaces get the configured checks.
 - Works with any workload controller
 
 ### Negative
-- Adds 30-60s pod startup latency (DCGM diag)
-- Requires privileged init container
-- Webhook downtime blocks pod creation
+- Adds 30-60s pod startup latency (DCGM diag level 1)
+- Requires privileged init container for DCGM
+- Webhook downtime blocks pod creation (if `failurePolicy: Fail`)
 
 ### Mitigations
-- `failurePolicy: Ignore` if latency unacceptable
-- Timeout configuration
-- HA deployment (replicas, PDB)
+- **Latency**: Use DCGM level 1 (~30s) instead of level 2 (~2-3min); skip expensive checks for non-critical workloads
+- **Privileged**: Required for hardware access; limit to specific namespaces
+- **Webhook availability**: HA deployment (replicas, PDB); `failurePolicy: Ignore` allows pods through if webhook is down
 
 ## Alternatives Considered
 

From 2a06cc37aee1cb2a8a43eea54b419a21ea9fe079 Mon Sep 17 00:00:00 2001
From: Ajay Mishra <ajmishra@nvidia.com>
Date: Thu, 15 Jan 2026 09:17:25 +0530
Subject: [PATCH 06/11] chore: address review comments

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
---
 docs/designs/026-preflight-checks.md | 443 ++++++++++++++++++++++-----
 1 file changed, 361 insertions(+), 82 deletions(-)

diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md
index d7d9647b2..a4f7995d2 100644
--- a/docs/designs/026-preflight-checks.md
+++ b/docs/designs/026-preflight-checks.md
@@ -22,73 +22,53 @@ Preflight asks "is this GPU healthy enough to start?" Health monitors ask "did t
 
 ## Decision
 
-Implement a MutatingAdmissionWebhook that injects preflight check init containers into GPU pods (pods requesting `nvidia.com/gpu`) in configured namespaces.
+Implement a MutatingAdmissionWebhook that injects preflight check init containers into GPU pods in configured namespaces.
 
-- Injection trigger: GPU resource request + namespace
-- Gang coordination (NCCL all-reduce): Uses `workloadRef` if present, skipped otherwise
+### Key points
 
-## Implementation
+- Injection trigger: GPU resources (extended resources or DRA claims) + namespace
+- Gang coordination: Uses `workloadRef` for gang-wide checks when present
+- Resource detection: Configurable lists for extended resource names and DRA device classes
 
-### Component Structure
+## Architecture
+
+### Components
 
 ```
 preflight/
 ├── injector/                       # Webhook (Deployment)
-│   ├── main.go
-│   ├── go.mod
-│   ├── Makefile
-│   ├── Tiltfile
-│   └── pkg/
-│       ├── config/
-│       │   └── config.go
-│       ├── webhook/
-│       │   └── v1alpha1/
-│       │       ├── handler.go
-│       │       └── handler_test.go
-│       ├── injection/
-│       │   ├── injector.go
-│       │   └── injector_test.go
-│       └── metrics/
-│           └── metrics.go
-│
-├── checker/                        # Init container image
-│   ├── main.go
-│   ├── go.mod
-│   ├── Makefile
-│   ├── Tiltfile
 │   └── pkg/
-│       ├── runner/
-│       │   └── runner.go
-│       ├── checks/
-│       │   ├── dcgm/
-│       │   │   └── diag.go         # dcgmi diag -r 1/2
-│       │   └── nccl/
-│       │       ├── loopback.go
-│       │       └── allreduce.go
-│       ├── coordination/
-│       │   ├── discovery.go        # Peer discovery via workloadRef
-│       │   └── configmap.go        # NCCL ID sharing
-│       ├── reporting/
-│       │   └── healthevents.go
-│       └── metrics/
-│           └── metrics.go
+│       ├── webhook/                # Admission handler
+│       └── injection/              # Pod mutation + DRA detection
 │
-└── Makefile                        # Builds both
+└── checker/                        # Init container image
+    ├── nccl-topologies/            # Built-in topology files
+    └── pkg/
+        ├── checks/                 # dcgm + nccl
+        ├── coordination/           # gang registration + NCCL ID
+        └── reporting/              # HealthEvent reporting
 ```
 
-### Webhook Flow
+### Webhook flow
 
 ```mermaid
 flowchart TD
-    A[Pod CREATE request] --> B{Has GPU resource?}
-    B -->|No| C[Allow - no mutation]
-    B -->|Yes| D[Inject init container]
+    A[Pod CREATE request] --> B{GPU resources?}
+    B -->|No| C[Allow]
+    B -->|Yes| D[Inject init containers]
     D --> E[Return JSON patch]
 ```
 
-Namespace filtering handled by `namespaceSelector` in webhook config. Checks configured at deployment time.
+Namespace filtering handled by `namespaceSelector` in webhook config.
 
-### MutatingWebhookConfiguration
+### Namespace model
+
+- NVSentinel Helm chart is installed in `nvsentinel` namespace (webhook Deployment runs there).
+- Webhook mutates Pods in *other* namespaces based on `namespaceSelector` (and skips system namespaces).
+- The injected init containers run in the workload namespace.
+- Any Kubernetes API access needed by the init container (gang coordination ConfigMap + Workload reads) must be granted in the workload namespace (namespace-scoped Role/RoleBinding). This is created by the Helm chart in the opted-in namespaces.
+
+### MutatingWebhookConfiguration (sketch)
 
 ```yaml
 apiVersion: admissionregistration.k8s.io/v1
@@ -112,14 +92,22 @@ webhooks:
         - key: kubernetes.io/metadata.name
           operator: In
           values: []  # Populated from Helm values
+        - key: kubernetes.io/metadata.name
+          operator: NotIn
+          values: []  # Excluded namespaces (systemNamespaces, nvsentinel, etc.)
     failurePolicy: Fail
     sideEffects: None
     admissionReviewVersions: ["v1"]
 ```
 
-Namespace list populated from Helm values.
+## Resource detection and injection
 
-### Init Container Spec
+### Detection logic
+
+1. Extended resources (device plugins): check `resources.limits`/`resources.requests` for configured names (e.g. `nvidia.com/gpu`)
+2. DRA: check `spec.resourceClaims`, resolve claim/template, match `deviceClassName` against configured list
+
+### Init container spec (sketch)
 
 ```yaml
 initContainers:
@@ -134,21 +122,36 @@ initContainers:
         value: "300s"
       - name: GANG_TIMEOUT
         value: "600s"
+      - name: PLATFORM_CONNECTOR_SOCKET
+        value: "unix:///var/run/nvsentinel.sock"
+      - name: MY_POD_NAME
+        valueFrom:
+          fieldRef:
+            fieldPath: metadata.name
+      - name: MY_POD_IP
+        valueFrom:
+          fieldRef:
+            fieldPath: status.podIP
     resources:
       limits:
-        nvidia.com/gpu: 8  # Copied from main container
+        nvidia.com/gpu: 8          # Max across all containers
+        nvidia.com/mlnxnics: 4     # Max across all containers (if NCCL enabled)
     securityContext:
-      privileged: true
+      privileged: true             # DCGM diag
     volumeMounts:
       - name: dcgm-socket
         mountPath: /var/run/nvidia
       - name: platform-connector-socket
-        mountPath: /var/run/nvsentinel
+        mountPath: /var/run
 ```
 
-**GPU resource handling:** Webhook copies `nvidia.com/gpu` from main container to init container (GPU allocation is per-pod).
+### Resource handling
+
+- GPUs / extended resources: inject max across all containers
+- Network / extended resources: inject max across all containers for configured names
+- DRA: inject all referenced GPU/network claims into init container
 
-### Check Types
+## Check types
 
 | Check | Scope | Coordination |
 |-------|-------|--------------|
@@ -192,10 +195,12 @@ initContainers:
 
 **Plugin contract:**
 - Exit codes: `0` (passed), `1` (check failed), `2` (config error)
-- Write HealthEvent to Platform Connector socket (same as built-in checks)
-- Plugin sets `isFatal`, `recommendedAction` in HealthEvent
-- Platform Connector overrides can modify values
-- Webhook mounts same volumes (GPU, DCGM socket, Platform Connector socket)
+- Report failures via gRPC to Platform Connector:
+  - Unix socket: `unix:///var/run/nvsentinel.sock` (matches global `socketPath`)
+  - Use `HealthEventOccurredV1` RPC (service `PlatformConnector`, proto `data-models/protobufs/health_event.proto`)
+  - Plugin sets `isFatal`, `recommendedAction`, `errorCode` in HealthEvent
+  - Platform Connector overrides can modify these values via CEL rules
+- Webhook mounts required volumes: GPU devices, DCGM socket, Platform Connector socket
 
 ### Configuration
 
@@ -203,47 +208,237 @@ Configured at deployment time via Helm values. No per-workload annotations.
 
 ### Gang Coordination
 
-For gang-wide checks like `nccl-allreduce`, pods discover peers using `workloadRef`:
+For gang-wide checks like `nccl-allreduce`, pods discover peers via ConfigMap registration:
 
 ```mermaid
 sequenceDiagram
-    participant R0 as Rank 0 Init
-    participant R1 as Rank 1 Init
+    participant W as Webhook
+    participant P0 as Pod 0 Init
+    participant P1 as Pod 1 Init
     participant API as Kube API
     participant CM as ConfigMap
 
-    R0->>API: List pods with same workloadRef
-    R1->>API: List pods with same workloadRef
+    Note over W: First pod in gang
+    W->>API: Create ConfigMap (expected=2, peers="")
+    
+    P0->>API: Patch ConfigMap: add pod-0:10.0.1.5
+    P1->>API: Patch ConfigMap: add pod-1:10.0.1.6
     
-    Note over R0,R1: Determine rank by sorting pod names
+    P0->>API: Poll until len(peers) == expected
+    P1->>API: Poll until len(peers) == expected
     
-    R0->>CM: Create ConfigMap with NCCL unique ID
-    R1->>CM: Poll until ConfigMap exists
-    R1->>CM: Read NCCL unique ID
+    Note over P0,P1: Determine rank by sorting pod names
     
-    R0->>R1: nccl.init() (barrier inside NCCL)
-    R0->>R1: nccl.all_reduce()
+    P0->>CM: Update with NCCL unique ID
+    P1->>CM: Read NCCL unique ID
+    
+    P0->>P1: nccl.init() (barrier inside NCCL)
+    P0->>P1: nccl.all_reduce()
 ```
 
-**Peer discovery via workloadRef:**
-- Init container lists pods where `workloadRef.name` and `workloadRef.podGroup` match
-- Gets peer IPs directly from pod list
+**Peer registration (no pod listing):**
+- Webhook idempotently creates ConfigMap named `preflight-<workloadRef.name>-<workloadRef.podGroup>` with `expected_count`
+- Each init container patches ConfigMap to add its IP
+- Init containers poll until all peers register
 - Determines rank by sorting pod names alphabetically
 
-**NCCL ID sharing:**
-- Rank 0 creates ConfigMap named `preflight-{workload}-{podgroup}`
-- Other ranks poll until ConfigMap exists (10 min timeout)
-- ConfigMap has owner reference to Workload for cleanup
+**ConfigMap structure:**
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: preflight-myworkload-group1
+  ownerReferences:
+    - apiVersion: scheduling.k8s.io/v1alpha1
+      kind: Workload
+      name: myworkload
+data:
+  expected_count: "2"
+  peers: |
+    pod-0:10.0.1.5
+    pod-1:10.0.1.6
+  nccl_unique_id: "base64..."  # Added by rank 0
+```
 
-Webhook injects the init container. No Service or other resources created.
+**Security:** Init containers have minimal RBAC (get/patch ConfigMap, get Workload). No pod list permission.
 
 **Gang coordination timeout:** 10 minutes. If gang doesn't form, init fails with `isFatal: false` (not a hardware issue).
 
+### RBAC (gang coordination)
+
+Use a namespace-scoped Role for coordination. Kubernetes RBAC does not support label-based restrictions for ConfigMaps, so the checker enforces scope in code (expected ConfigMap name + required labels/ownerRef).
+
+```yaml
+rules:
+  - apiGroups: ["scheduling.k8s.io"]
+    resources: ["workloads"]
+    verbs: ["get"]
+  - apiGroups: [""]
+    resources: ["configmaps"]
+    verbs: ["get", "create", "patch"]
+```
+
+Checker only reads/writes the coordination ConfigMap `preflight-<workloadRef.name>-<workloadRef.podGroup>` in its own namespace.
+
+### DRA Integration
+
+For pods using Dynamic Resource Allocation (DRA), the webhook copies resource claim references to the init container.
+
+**Device claim detection:**
+Webhook checks pod's `spec.resourceClaims`, retrieves each ResourceClaim or ResourceClaimTemplate, and matches `deviceClassName` against configurable lists for GPUs and network devices:
+
+```yaml
+# Helm values
+preflight-injector:
+  gpuDetection:
+    # Extended resources (current, no DRA)
+    resourceNames:
+      - "nvidia.com/gpu"
+    
+    # DRA device classes (requires operator configuration)
+    deviceClasses:
+      - "gpu.nvidia.com"
+      - "nvidia.com/gpu"
+      # Operators add their DeviceClass names here
+```
+
+**Init container injection with DRA:**
+```yaml
+apiVersion: v1
+kind: Pod
+spec:
+  # Pod-level claims
+  resourceClaims:
+    - name: gpu-claim
+      resourceClaimName: training-gpus
+    - name: rdma-claim
+      resourceClaimName: training-rdma
+  
+  initContainers:
+    - name: nvsentinel-preflight
+      resources:
+        claims:
+          - name: gpu-claim   # References same GPU claim
+          - name: rdma-claim  # References same network claim
+  
+  containers:
+    - name: main
+      resources:
+        claims:
+          - name: gpu-claim
+          - name: rdma-claim
+```
+
+**Multiple containers with GPUs:**
+```yaml
+# Extended resources example
+containers:
+  - name: trainer
+    resources:
+      limits:
+        nvidia.com/gpu: 4
+        nvidia.com/mlnxnics: 2
+  - name: validator
+    resources:
+      limits:
+        nvidia.com/gpu: 8
+        nvidia.com/mlnxnics: 4
+
+# Init container gets max(4, 8) = 8 GPUs, max(2, 4) = 4 NICs
+initContainers:
+  - name: nvsentinel-preflight
+    resources:
+      limits:
+        nvidia.com/gpu: 8
+        nvidia.com/mlnxnics: 4
+```
+
+**Detection logic:**
+1. Check if pod uses extended resources (`nvidia.com/gpu`, `nvidia.com/mlnxnics`) → inject with max counts across all containers
+2. Check if pod has DRA claims with matching `deviceClassName` → inject with all unique GPU and network claim references
+3. If neither → skip injection
+
+Network devices (InfiniBand, RDMA) can be exposed via DRA claims or extended resources. Webhook uses same detection pattern for both.
+
+DRA device class names are not standardized. Operators configure `gpuDetection.deviceClasses` and `networkDetection.deviceClasses` to match cluster DeviceClass names.
+
+### Network Resources for NCCL Tests
+
+NCCL tests require access to RDMA/InfiniBand devices for efficient GPU-to-GPU communication.
+
+**Network device exposure methods:**
+
+1. **Extended resources (device plugins):**
+   - Example: `nvidia.com/mlnxnics` (common on GPU+IB clusters)
+   - Resource names are cluster-specific; configure `networkDetection.resourceNames` accordingly
+
+2. **DRA claims:**
+   - Network devices can also be exposed via DRA claims (DeviceClass names are cluster-specific)
+   - Webhook matches claim `deviceClassName` against `networkDetection.deviceClasses`
+
+**Webhook behavior for NCCL checks:**
+If `nccl-loopback` or `nccl-allreduce` is enabled, webhook:
+1. Copies all network device resources (extended resources using max count, or DRA claim references)
+2. Scans all container env vars, copies those matching `ncclEnvPatterns` (glob patterns from Helm config)
+3. Copies volume mounts referenced by `NCCL_TOPO_FILE` (if present)
+
+**Example: How env vars are copied**
+
+Main container has:
+```yaml
+env:
+  - name: NCCL_TOPO_FILE
+    value: /etc/nccl/topo.xml
+  - name: NCCL_IB_PCI_RELAXED_ORDERING
+    value: "1"
+  - name: NCCL_SOCKET_IFNAME
+    value: eth0
+  - name: MY_APP_CONFIG
+    value: /app/config.yaml
+  - name: OMPI_MCA_btl
+    value: openib
+```
+
+Webhook with `ncclEnvPatterns: ["NCCL_*", "OMPI_*"]` copies to init container:
+```yaml
+env:
+  - name: NCCL_TOPO_FILE           # Matches NCCL_*
+    value: /etc/nccl/topo.xml
+  - name: NCCL_IB_PCI_RELAXED_ORDERING  # Matches NCCL_*
+    value: "1"
+  - name: NCCL_SOCKET_IFNAME       # Matches NCCL_*
+    value: eth0
+  - name: OMPI_MCA_btl             # Matches OMPI_*
+    value: openib
+  # MY_APP_CONFIG NOT copied (doesn't match patterns)
+volumeMounts:
+  - name: nccl-topology            # Copied because NCCL_TOPO_FILE references it
+    mountPath: /etc/nccl
+```
+
+**NCCL topology file handling:**
+The init container image includes common topology files for major cloud platforms:
+```
+/opt/nvsentinel/nccl-topologies/
+├── azure-ndv4.xml
+├── azure-ndv5.xml
+├── aws-p5.48xlarge.xml
+├── gcp-a3-mega.xml
+└── oci-bm-gpu-a100.xml
+```
+
+**Topology selection priority:**
+1. **User-provided**: Webhook checks if any container has `NCCL_TOPO_FILE` env var with a corresponding volume mount at that path → copy that volume mount to init container
+2. **Auto-detect**: If no `NCCL_TOPO_FILE` + volume mount found, init container reads node label `node.kubernetes.io/instance-type`, maps to built-in topology file via Helm config
+3. **Fallback**: If instance type unknown or not in mapping, don't set `NCCL_TOPO_FILE` (NCCL auto-detects topology)
+
+If pod has no network device resources, NCCL tests are skipped (DCGM diag runs).
+
 ### Failure Behavior
 
 Init container exit codes:
 - `0`: All checks passed
-- `1`: Check failed, pod should not start
+- `1`: Check failed
 - `2`: Configuration error
 
 On failure:
@@ -282,6 +477,33 @@ HealthEvent feeds into existing NVSentinel workflow (quarantine, correlation, et
 - NCCL hardware errors (`SYSTEM_ERROR`, `INTERNAL_ERROR`, `REMOTE_ERROR`) → `isFatal: true`
 - NCCL timeout/config errors → `isFatal: false`
 
+### Integration with Node Drainer
+
+Preflight failures quarantine nodes without draining. Rationale:
+- Workload never started → no pods to evict
+- Draining would disrupt other gang members waiting for coordination
+- Quarantine prevents new scheduling while remediation happens
+
+**Platform Connector override:**
+```yaml
+pipeline:
+  overrides:
+    - match:
+        agent: "preflight-checker"
+      override:
+        drainOverrides:
+          skip: true
+```
+
+**Flow:**
+1. Preflight fails → HealthEvent with `isFatal: true`
+2. Platform Connector applies override → `drainOverrides.skip: true`
+3. Node drainer sees `skip: true` → quarantines node (taint), skips drain
+4. Fault Remediation runs based on `recommendedAction` (EUD, support ticket, etc.)
+5. Remediation succeeds → taint removed → node back in rotation
+
+Gang members on other nodes timeout after `gangTimeout`, fail with `isFatal: false` (coordination failure, not hardware), no quarantine.
+
 ### Helm Values
 
 ```yaml
@@ -297,9 +519,62 @@ preflight-injector:
   checkTimeout: "300s"   # Per-check timeout
   gangTimeout: "600s"    # Gang coordination timeout
   
+  # GPU detection configuration
+  gpuDetection:
+    # Extended resources (current approach)
+    resourceNames:
+      - "nvidia.com/gpu"
+    
+    # DRA device classes (add your cluster's DeviceClass names)
+    deviceClasses: []
+    # Example:
+    # - "gpu.nvidia.com"
+    # - "nvidia.com/gpu"
+  
+  # Network device resources (for NCCL tests)
+  networkDetection:
+    # Extended resources
+    resourceNames:
+      - "nvidia.com/mlnxnics"
+      - "rdma/hca"
+      # Add other network device plugin resources used in your cluster
+    
+    # DRA device classes (if using DRA for network devices)
+    deviceClasses: []
+    # Example:
+    # - "rdma.nvidia.com"
+    # - "infiniband.mellanox.com"
+  
+  # NCCL environment variable patterns to copy (glob patterns)
+  # Webhook scans container env vars, copies those matching any pattern
+  ncclEnvPatterns:
+    - "NCCL_*"      # Matches NCCL_TOPO_FILE, NCCL_IB_*, etc.
+    - "UCX_*"       # Matches UCX_TLS, UCX_NET_DEVICES, etc.
+    - "OMPI_*"      # Matches OMPI_MCA_*, etc.
+  
+  # NCCL topology auto-detection (if user doesn't provide topology file)
+  ncclTopology:
+    # Node label to detect instance type
+    instanceTypeLabel: "node.kubernetes.io/instance-type"
+    # Map instance types to built-in topology files
+    instanceTypeMapping:
+      "Standard_ND96isr_H100_v5": "azure-ndv5.xml"
+      "Standard_ND96amsr_A100_v4": "azure-ndv4.xml"
+      "p5.48xlarge": "aws-p5.48xlarge.xml"
+      "a3-megagpu-8g": "gcp-a3-mega.xml"
+    # Fallback: use NCCL auto-detection if instance type unknown
+    enableFallback: true
+  
   # Namespaces where preflight checks apply
   namespaces:
     - training
+
+  # Namespaces to exclude (system namespaces). Recommended to reuse node-drainer `systemNamespaces`.
+  excludeNamespaces:
+    - nvsentinel
+    - kube-system
+    - kube-public
+    - kube-node-lease
   
   webhook:
     failurePolicy: Fail  # or Ignore
@@ -342,16 +617,19 @@ All GPU pods in listed namespaces get the configured checks.
 ### Positive
 - Catches GPU failures before workload starts
 - Works with any workload controller
+- Built-in NCCL topology files for major cloud platforms
 
 ### Negative
 - Adds 30-60s pod startup latency (DCGM diag level 1)
 - Requires privileged init container for DCGM
 - Webhook downtime blocks pod creation (if `failurePolicy: Fail`)
+- NCCL tests require network device plugins (InfiniBand/RDMA) to be configured
 
 ### Mitigations
-- **Latency**: Use DCGM level 1 (~30s) instead of level 2 (~2-3min); skip expensive checks for non-critical workloads
+- **Latency**: Use DCGM level 1 (~30s) vs level 2 (~2-3min); skip expensive checks for non-critical workloads
 - **Privileged**: Required for hardware access; limit to specific namespaces
-- **Webhook availability**: HA deployment (replicas, PDB); `failurePolicy: Ignore` allows pods through if webhook is down
+- **Webhook availability**: HA deployment (replicas, PDB); `failurePolicy: Ignore` for graceful degradation
+- **Network resources**: NCCL tests skipped if network devices unavailable; DCGM diag runs regardless
 
 ## Alternatives Considered
 
@@ -367,6 +645,7 @@ Rejected: Requires changing how workloads are deployed.
 ## Out of Scope
 
 - **Repeated failure handling**: Health Event Analyzer handles pattern detection. Preflight emits events.
+- **Automatic DRA DeviceClass discovery**: Requires operator configuration. Device class names are not standardized.
 
 ## References
 

From d8052558690f774737fc624a17bf70ae10dbeb56 Mon Sep 17 00:00:00 2001
From: Ajay Mishra <ajmishra@nvidia.com>
Date: Fri, 16 Jan 2026 13:25:42 +0530
Subject: [PATCH 07/11] chore: address review comments

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
---
 docs/designs/026-preflight-checks.md | 535 +++++++++++++++------------
 1 file changed, 293 insertions(+), 242 deletions(-)

diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md
index a4f7995d2..7a2a24caa 100644
--- a/docs/designs/026-preflight-checks.md
+++ b/docs/designs/026-preflight-checks.md
@@ -1,22 +1,21 @@
-# ADR-026: Feature — Preflight Checks via Init Container Injection
+# ADR-026: Feature — Preflight Checks
 
 ## Context
 
 GPU failures during training waste compute time. Running diagnostics before the workload starts catches bad GPUs early.
 
-Kubernetes 1.35 introduced `spec.workloadRef` for gang scheduling. Preflight can use `workloadRef` to discover peer pods and run gang-wide checks (NCCL all-reduce).
+Gang-wide NCCL tests require discovering all pods in a gang. Kubernetes 1.35 introduced `spec.workloadRef` as a native gang identifier, but users may also use Volcano, Kueue, or other schedulers with their own mechanisms.
 
 ### Distinction from Health Monitors
 
 NVSentinel already has health monitors (GPU Health Monitor, Syslog Health Monitor) that detect GPU issues. This is different:
 
-| | Health Monitors | Preflight Checks |
-|-|-----------------|------------------|
-| When | Continuous (DaemonSet) | Once at pod start (init container) |
-| Check type | Passive (health watches, syslog parsing) | Active diagnostics (DCGM diag) |
-| Detects | Failures as they occur (XID errors, ECC, thermal) | Latent issues before starting |
-| NCCL tests | No | Yes |
-| Purpose | Reactive remediation | Prevent bad starts |
+|            | Health Monitors        | Preflight Checks              |
+|------------|------------------------|-------------------------------|
+| When       | Continuous             | Once at pod start             |
+| Check type | Passive                | Active diagnostics            |
+| Detects    | Failures as they occur | Latent issues before starting |
+| Purpose    | Reactive remediation   | Prevent bad starts            |
 
 Preflight asks "is this GPU healthy enough to start?" Health monitors ask "did this GPU fail while running?"
 
@@ -27,26 +26,43 @@ Implement a MutatingAdmissionWebhook that injects preflight check init container
 ### Key points
 
 - Injection trigger: GPU resources (extended resources or DRA claims) + namespace
-- Gang coordination: Uses `workloadRef` for gang-wide checks when present
+- Gang discovery: Pluggable (supports `workloadRef`; can be extended to Volcano, Kueue .etc.)
 - Resource detection: Configurable lists for extended resource names and DRA device classes
 
 ## Architecture
 
 ### Components
 
+Each check is a separate image. Webhook injects one init container per enabled check.
+
 ```
 preflight/
-├── injector/                       # Webhook (Deployment)
+├── injector/
+│   └── pkg/
+│       ├── webhook/
+│       └── injection/
+│
+├── controller/
+│   └── pkg/
+│       ├── gang/
+│       └── coordination/
+│
+├── dcgm-diag/
+│   ├── Dockerfile
+│   ├── main.go
+│   └── pkg/
+│
+├── nccl-loopback/
+│   ├── Dockerfile
+│   ├── nccl-topologies/
+│   ├── main.go
 │   └── pkg/
-│       ├── webhook/                # Admission handler
-│       └── injection/              # Pod mutation + DRA detection
 │
-└── checker/                        # Init container image
-    ├── nccl-topologies/            # Built-in topology files
+└── nccl-allreduce/
+    ├── Dockerfile
+    ├── nccl-topologies/
+    ├── main.go
     └── pkg/
-        ├── checks/                 # dcgm + nccl
-        ├── coordination/           # gang registration + NCCL ID
-        └── reporting/              # HealthEvent reporting
 ```
 
 ### Webhook flow
@@ -61,13 +77,6 @@ flowchart TD
 
 Namespace filtering handled by `namespaceSelector` in webhook config.
 
-### Namespace model
-
-- NVSentinel Helm chart is installed in `nvsentinel` namespace (webhook Deployment runs there).
-- Webhook mutates Pods in *other* namespaces based on `namespaceSelector` (and skips system namespaces).
-- The injected init containers run in the workload namespace.
-- Any Kubernetes API access needed by the init container (gang coordination ConfigMap + Workload reads) must be granted in the workload namespace (namespace-scoped Role/RoleBinding). This is created by the Helm chart in the opted-in namespaces.
-
 ### MutatingWebhookConfiguration (sketch)
 
 ```yaml
@@ -107,40 +116,40 @@ webhooks:
 1. Extended resources (device plugins): check `resources.limits`/`resources.requests` for configured names (e.g. `nvidia.com/gpu`)
 2. DRA: check `spec.resourceClaims`, resolve claim/template, match `deviceClassName` against configured list
 
-### Init container spec (sketch)
+### Injected init containers (sketch)
+
+One init container per enabled check:
 
 ```yaml
 initContainers:
-  - name: nvsentinel-preflight
-    image: ghcr.io/nvidia/nvsentinel/preflight-checker:v1
+  - name: preflight-dcgm-diag
+    image: ghcr.io/nvidia/nvsentinel/preflight-dcgm-diag:v1
     env:
-      - name: PREFLIGHT_CHECKS
-        value: "dcgm-diag,nccl-loopback"
       - name: DCGM_DIAG_LEVEL
         value: "1"
-      - name: CHECK_TIMEOUT
-        value: "300s"
-      - name: GANG_TIMEOUT
-        value: "600s"
+      - name: DCGM_HOSTENGINE_ADDR
+        value: "dcgm-hostengine.nvsentinel.svc:5555"
       - name: PLATFORM_CONNECTOR_SOCKET
         value: "unix:///var/run/nvsentinel.sock"
-      - name: MY_POD_NAME
-        valueFrom:
-          fieldRef:
-            fieldPath: metadata.name
-      - name: MY_POD_IP
-        valueFrom:
-          fieldRef:
-            fieldPath: status.podIP
     resources:
       limits:
-        nvidia.com/gpu: 8          # Max across all containers
-        nvidia.com/mlnxnics: 4     # Max across all containers (if NCCL enabled)
-    securityContext:
-      privileged: true             # DCGM diag
+        nvidia.com/gpu: 8  # Max across all containers
+    volumeMounts:
+      - name: platform-connector-socket
+        mountPath: /var/run
+
+  - name: preflight-nccl-loopback
+    image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:v1
+    env:
+      - name: NCCL_LOOPBACK_THRESHOLD_GBPS
+        value: "10.0"
+      - name: PLATFORM_CONNECTOR_SOCKET
+        value: "unix:///var/run/nvsentinel.sock"
+    resources:
+      limits:
+        nvidia.com/gpu: 8
+        nvidia.com/mlnxnics: 4
     volumeMounts:
-      - name: dcgm-socket
-        mountPath: /var/run/nvidia
       - name: platform-connector-socket
         mountPath: /var/run
 ```
@@ -153,94 +162,184 @@ initContainers:
 
 ## Check types
 
-| Check | Scope | Coordination |
-|-------|-------|--------------|
-| `dcgm-diag` | Single node | None |
-| `nccl-loopback` | Single node | None |
-| `nccl-allreduce` | Gang-wide | ConfigMap |
-| `plugin:<name>` | Varies | Varies |
+| Check            | Scope       | Coordination |
+|------------------|-------------|--------------|
+| `dcgm-diag`      | Single node | None         |
+| `nccl-loopback`  | Single node | None         |
+| `nccl-allreduce` | Gang-wide   | ConfigMap    |
 
-### Plugin Interface (Third-Party Checks)
+Third-party checks follow the same pattern: separate image, configured in Helm.
 
-Plugins are separate init containers. Webhook injects one container per plugin.
+### DCGM Diag
 
-**Registration:**
-```yaml
-preflight-injector:
-  plugins:
-    - name: bandwidth-check
-      image: myregistry/bandwidth-check:v1
-      timeout: "60s"
+Runs DCGM diagnostics on allocated GPUs via remote DCGM hostengine Service.
+
+**How it works:**
+1. Init container gets GPU UUIDs: `nvidia-smi --query-gpu=uuid --format=csv,noheader`
+2. Calls DCGM hostengine via Service: `dcgmi diag -r <level> --host $DCGM_HOSTENGINE_ADDR -i <gpu-uuids>`
+3. Parses results, maps failures to HealthEvents
+
+**Requirements:**
+- DCGM hostengine DaemonSet running (privileged, with GPU access)
+- DCGM Service exposing hostengine (port 5555)
+- NetworkPolicy allowing init container → DCGM Service
+
+**Diag levels:**
+- Level 1 (~30s): Quick hardware validation (memory, PCIe bandwidth)
+- Level 2 (~2-3min): Extended tests (stress, targeted diagnostics)
+
+Init container remains unprivileged; hostengine performs diagnostics.
+
+### NCCL Loopback
+
+Tests intra-node GPU-to-GPU communication (NVLink/PCIe paths) without network.
+
+**How it works:**
+1. Init container runs `all_reduce_perf` (from nccl-tests) with all allocated GPUs
+2. Command: `all_reduce_perf -b 8 -e 256M -f 2 -g <num_gpus>`
+3. Validates bandwidth meets threshold set in Helm values
+4. No coordination needed — single node only
+
+**What it catches:**
+- NVLink failures between GPUs
+- PCIe bandwidth degradation
+- GPU memory errors during collective ops
+
+**Requirements:**
+- GPU allocation (device plugin)
+- `nccl-tests` binary in checker image
+
+**Example output parsing:**
+```
+# nccl-tests output format:
+#       size    count   type   redop    time   algbw   busbw
+         8M    2097152  float     sum    1.23   6.50    12.19
+```
+Checker validates `busbw` (bus bandwidth) against configured threshold.
+
+### NCCL All-Reduce (Gang-Wide)
+
+Tests cross-node GPU collective communication over RDMA/InfiniBand.
+
+**How it works:**
+1. **Gang formation**: All pods register in shared ConfigMap (see Gang Coordination section)
+2. **Rank assignment**: Sort pod names alphabetically → rank 0, 1, 2, ...
+3. **NCCL bootstrap**: Controller generates NCCL unique ID, writes to ConfigMap
+4. **Run test**: Each pod reads ConfigMap and runs `all_reduce_perf` independently
+
+**Command:**
+```bash
+NCCL_COMM_ID=<nccl_unique_id> \
+NCCL_NRANKS=$WORLD_SIZE \
+NCCL_RANK=$MY_RANK \
+all_reduce_perf -b 8 -e 256M -f 2 -g $GPUS_PER_NODE
 ```
 
-**Injected init containers:**
+Each init container runs independently. NCCL handles cross-node coordination via the shared `NCCL_COMM_ID`.
+
+**What it catches:**
+- InfiniBand/RDMA link failures
+- Network topology misconfigurations
+- Cross-node NVLink (when present)
+- NCCL algorithm/protocol issues
+
+**Requirements:**
+- `workloadRef` for gang discovery (K8s 1.35+)
+- Network device allocation (InfiniBand NICs)
+- NCCL topology file (auto-detected or user-provided)
+- ConfigMap RBAC for coordination
+
+**Timeout handling:**
+- `GANG_TIMEOUT` sets max wait for all peers to register
+- If timeout expires before gang forms → exit with `isFatal: false` (not a hardware issue)
+
+### Third-Party Checks
+
+Third-party checks follow the same pattern as built-in checks. Register in Helm:
+
 ```yaml
-initContainers:
-  # Built-in checks
-  - name: nvsentinel-preflight
-    image: ghcr.io/nvidia/nvsentinel/preflight-checker:v1
-    ...
-  
-  # Plugin (separate container)
-  - name: preflight-bandwidth-check
-    image: myregistry/bandwidth-check:v1
-    env:
-      - name: CHECK_TIMEOUT
-        value: "60s"
-      - name: NODE_NAME
-        valueFrom:
-          fieldRef:
-            fieldPath: spec.nodeName
+preflight-injector:
+  checks:
+    - name: dcgm-diag
+      image: ghcr.io/nvidia/nvsentinel/preflight-dcgm-diag:v1
+    - name: nccl-loopback
+      image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:v1
+    - name: bandwidth-check  # third-party
+      image: myregistry/bandwidth-check:v1
 ```
 
-**Plugin contract:**
+**Check contract:**
 - Exit codes: `0` (passed), `1` (check failed), `2` (config error)
 - Report failures via gRPC to Platform Connector:
-  - Unix socket: `unix:///var/run/nvsentinel.sock` (matches global `socketPath`)
-  - Use `HealthEventOccurredV1` RPC (service `PlatformConnector`, proto `data-models/protobufs/health_event.proto`)
-  - Plugin sets `isFatal`, `recommendedAction`, `errorCode` in HealthEvent
-  - Platform Connector overrides can modify these values via CEL rules
-- Webhook mounts required volumes: GPU devices, DCGM socket, Platform Connector socket
+  - Unix socket: `unix:///var/run/nvsentinel.sock`
+  - RPC: `HealthEventOccurredV1` (proto: `data-models/protobufs/health_event.proto`)
+  - Set `isFatal`, `recommendedAction`, `errorCode` in HealthEvent
+- Webhook mounts: GPU devices, Platform Connector socket, network devices
 
 ### Configuration
 
 Configured at deployment time via Helm values. No per-workload annotations.
 
+### Gang Discovery
+
+Gang discovery is pluggable. Given one pod, return all pods in the gang.
+
+**Interface:**
+```go
+type GangDiscoverer interface {
+    DiscoverGang(pod *corev1.Pod) ([]PeerInfo, error)
+}
+
+type PeerInfo struct {
+    PodName  string
+    PodIP    string
+    NodeName string
+}
+```
+
+**Implementations:**
+
+| Scheduler       | Discovery chain                                                          |
+|-----------------|--------------------------------------------------------------------------|
+| K8s 1.35 native | Pod → `spec.workloadRef` → list pods with same ref                       |
+| Volcano         | Pod → `volcano.sh/pod-group` annotation → list pods with same annotation |
+| Kueue           | Pod → `kueue.x-k8s.io/workload-name` label → list pods with same label   |
+| Label-based     | Pod → configurable labels → list pods with same labels                   |
+
+Controller selects implementation based on Helm config. If no gang identifier found, pod is treated as singleton (skip gang-wide tests).
+
 ### Gang Coordination
 
-For gang-wide checks like `nccl-allreduce`, pods discover peers via ConfigMap registration:
+For gang-wide checks like `nccl-allreduce`, the preflight controller maintains a ConfigMap with peer registration and NCCL bootstrap data. Pods only read it.
 
 ```mermaid
 sequenceDiagram
-    participant W as Webhook
+    participant C as Preflight Controller
     participant P0 as Pod 0 Init
     participant P1 as Pod 1 Init
     participant API as Kube API
     participant CM as ConfigMap
 
-    Note over W: First pod in gang
-    W->>API: Create ConfigMap (expected=2, peers="")
-    
-    P0->>API: Patch ConfigMap: add pod-0:10.0.1.5
-    P1->>API: Patch ConfigMap: add pod-1:10.0.1.6
-    
-    P0->>API: Poll until len(peers) == expected
-    P1->>API: Poll until len(peers) == expected
-    
+    C->>API: Create/Update ConfigMap (expected=2, peers="")
+    C->>API: Update ConfigMap: add pod-0:10.0.1.5
+    C->>API: Update ConfigMap: add pod-1:10.0.1.6
+    C->>API: Update ConfigMap: set nccl_unique_id
+
+    P0->>API: Read ConfigMap until len(peers) == expected
+    P1->>API: Read ConfigMap until len(peers) == expected
+
     Note over P0,P1: Determine rank by sorting pod names
-    
-    P0->>CM: Update with NCCL unique ID
-    P1->>CM: Read NCCL unique ID
-    
+
     P0->>P1: nccl.init() (barrier inside NCCL)
     P0->>P1: nccl.all_reduce()
 ```
 
-**Peer registration (no pod listing):**
-- Webhook idempotently creates ConfigMap named `preflight-<workloadRef.name>-<workloadRef.podGroup>` with `expected_count`
-- Each init container patches ConfigMap to add its IP
-- Init containers poll until all peers register
-- Determines rank by sorting pod names alphabetically
+**Peer registration (controller-managed):**
+- Preflight controller creates/updates ConfigMap `preflight-<gangID>` with `expected_count`
+- `gangID` derived from gang discoverer (e.g., `workload-name/pod-group`, `volcano-pg-name`, `kueue-workload-name`)
+- Controller watches pods in the gang and updates `peers` and `nccl_unique_id`
+- Init containers read/poll ConfigMap until all peers are registered
+- Each pod determines rank by sorting pod names alphabetically
 
 **ConfigMap structure:**
 ```yaml
@@ -257,35 +356,50 @@ data:
   peers: |
     pod-0:10.0.1.5
     pod-1:10.0.1.6
-  nccl_unique_id: "base64..."  # Added by rank 0
+  nccl_unique_id: "base64..."  # Added by controller
 ```
 
-**Security:** Init containers have minimal RBAC (get/patch ConfigMap, get Workload). No pod list permission.
+**Security:** Init containers read the ConfigMap only. Controller owns write access.
 
 **Gang coordination timeout:** 10 minutes. If gang doesn't form, init fails with `isFatal: false` (not a hardware issue).
 
-### RBAC (gang coordination)
+### RBAC
 
-Use a namespace-scoped Role for coordination. Kubernetes RBAC does not support label-based restrictions for ConfigMaps, so the checker enforces scope in code (expected ConfigMap name + required labels/ownerRef).
+Controller needs write access; init containers only need read. Both use ClusterRole since they operate across workload namespaces.
 
+**Controller ClusterRole:**
 ```yaml
 rules:
-  - apiGroups: ["scheduling.k8s.io"]
-    resources: ["workloads"]
-    verbs: ["get"]
   - apiGroups: [""]
     resources: ["configmaps"]
     verbs: ["get", "create", "patch"]
+  - apiGroups: [""]
+    resources: ["pods"]
+    verbs: ["get", "list", "watch"]
+  # Additional rules based on gang discoverer:
+  # workloadRef: scheduling.k8s.io/workloads (get)
+  # Volcano: scheduling.volcano.sh/podgroups (get)
+  # Kueue: kueue.x-k8s.io/workloads (get)
 ```
 
-Checker only reads/writes the coordination ConfigMap `preflight-<workloadRef.name>-<workloadRef.podGroup>` in its own namespace.
+Controller only touches ConfigMaps with `preflight-` prefix (enforced by code).
+
+**Init container ClusterRole:**
+```yaml
+rules:
+  - apiGroups: [""]
+    resources: ["configmaps"]
+    verbs: ["get"]
+```
+
+Init containers poll ConfigMap until all peers are registered.
 
 ### DRA Integration
 
 For pods using Dynamic Resource Allocation (DRA), the webhook copies resource claim references to the init container.
 
 **Device claim detection:**
-Webhook checks pod's `spec.resourceClaims`, retrieves each ResourceClaim or ResourceClaimTemplate, and matches `deviceClassName` against configurable lists for GPUs and network devices:
+Webhook checks pod's `spec.resourceClaims`, retrieves each ResourceClaim or ResourceClaimTemplate, and matches `deviceClassName` against configured lists for GPUs and network devices:
 
 ```yaml
 # Helm values
@@ -329,30 +443,6 @@ spec:
           - name: rdma-claim
 ```
 
-**Multiple containers with GPUs:**
-```yaml
-# Extended resources example
-containers:
-  - name: trainer
-    resources:
-      limits:
-        nvidia.com/gpu: 4
-        nvidia.com/mlnxnics: 2
-  - name: validator
-    resources:
-      limits:
-        nvidia.com/gpu: 8
-        nvidia.com/mlnxnics: 4
-
-# Init container gets max(4, 8) = 8 GPUs, max(2, 4) = 4 NICs
-initContainers:
-  - name: nvsentinel-preflight
-    resources:
-      limits:
-        nvidia.com/gpu: 8
-        nvidia.com/mlnxnics: 4
-```
-
 **Detection logic:**
 1. Check if pod uses extended resources (`nvidia.com/gpu`, `nvidia.com/mlnxnics`) → inject with max counts across all containers
 2. Check if pod has DRA claims with matching `deviceClassName` → inject with all unique GPU and network claim references
@@ -382,40 +472,6 @@ If `nccl-loopback` or `nccl-allreduce` is enabled, webhook:
 2. Scans all container env vars, copies those matching `ncclEnvPatterns` (glob patterns from Helm config)
 3. Copies volume mounts referenced by `NCCL_TOPO_FILE` (if present)
 
-**Example: How env vars are copied**
-
-Main container has:
-```yaml
-env:
-  - name: NCCL_TOPO_FILE
-    value: /etc/nccl/topo.xml
-  - name: NCCL_IB_PCI_RELAXED_ORDERING
-    value: "1"
-  - name: NCCL_SOCKET_IFNAME
-    value: eth0
-  - name: MY_APP_CONFIG
-    value: /app/config.yaml
-  - name: OMPI_MCA_btl
-    value: openib
-```
-
-Webhook with `ncclEnvPatterns: ["NCCL_*", "OMPI_*"]` copies to init container:
-```yaml
-env:
-  - name: NCCL_TOPO_FILE           # Matches NCCL_*
-    value: /etc/nccl/topo.xml
-  - name: NCCL_IB_PCI_RELAXED_ORDERING  # Matches NCCL_*
-    value: "1"
-  - name: NCCL_SOCKET_IFNAME       # Matches NCCL_*
-    value: eth0
-  - name: OMPI_MCA_btl             # Matches OMPI_*
-    value: openib
-  # MY_APP_CONFIG NOT copied (doesn't match patterns)
-volumeMounts:
-  - name: nccl-topology            # Copied because NCCL_TOPO_FILE references it
-    mountPath: /etc/nccl
-```
-
 **NCCL topology file handling:**
 The init container image includes common topology files for major cloud platforms:
 ```
@@ -453,23 +509,24 @@ HealthEvent feeds into existing NVSentinel workflow (quarantine, correlation, et
 
 **DCGM Diag** :
 
-| Test | Result | Recommended Action |
-|------|--------|-------------------|
-| Memory | `FAIL` | `CONTACT_SUPPORT` |
-| PCIe | `FAIL` | `CONTACT_SUPPORT` |
-| NVLink | `FAIL` | `CONTACT_SUPPORT` |
-| Stress | `FAIL` | `RUN_DCGMEUD` |
-| Any | `WARN` | `NONE` |
+| Test   | Result | Recommended Action |
+|--------|--------|--------------------|
+| Memory | `FAIL` | `CONTACT_SUPPORT`  |
+| PCIe   | `FAIL` | `CONTACT_SUPPORT`  |
+| NVLink | `FAIL` | `CONTACT_SUPPORT`  |
+| Stress | `FAIL` | `RUN_DCGMEUD`      |
+| Any    | `WARN` | `NONE`             |
+
 
 **NCCL Checks**:
 
-| Error | Recommended Action |
-|-------|-------------------|
-| `NCCL_SYSTEM_ERROR` | `CONTACT_SUPPORT` |
-| `NCCL_INTERNAL_ERROR` | `RUN_DCGMEUD` |
-| `NCCL_INVALID_USAGE` | `NONE` |
-| `NCCL_TIMEOUT` | `NONE` |
-| `NCCL_REMOTE_ERROR` | `CONTACT_SUPPORT` |
+| Error                 | Recommended Action |
+|-----------------------|--------------------|
+| `NCCL_SYSTEM_ERROR`   | `CONTACT_SUPPORT`  |
+| `NCCL_INTERNAL_ERROR` | `RUN_DCGMEUD`      |
+| `NCCL_INVALID_USAGE`  | `NONE`             |
+| `NCCL_TIMEOUT`        | `NONE`             |
+| `NCCL_REMOTE_ERROR`   | `CONTACT_SUPPORT`  |
 
 **isFatal determination**:
 - DCGM diag `FAIL` → `isFatal: true`
@@ -477,33 +534,6 @@ HealthEvent feeds into existing NVSentinel workflow (quarantine, correlation, et
 - NCCL hardware errors (`SYSTEM_ERROR`, `INTERNAL_ERROR`, `REMOTE_ERROR`) → `isFatal: true`
 - NCCL timeout/config errors → `isFatal: false`
 
-### Integration with Node Drainer
-
-Preflight failures quarantine nodes without draining. Rationale:
-- Workload never started → no pods to evict
-- Draining would disrupt other gang members waiting for coordination
-- Quarantine prevents new scheduling while remediation happens
-
-**Platform Connector override:**
-```yaml
-pipeline:
-  overrides:
-    - match:
-        agent: "preflight-checker"
-      override:
-        drainOverrides:
-          skip: true
-```
-
-**Flow:**
-1. Preflight fails → HealthEvent with `isFatal: true`
-2. Platform Connector applies override → `drainOverrides.skip: true`
-3. Node drainer sees `skip: true` → quarantines node (taint), skips drain
-4. Fault Remediation runs based on `recommendedAction` (EUD, support ticket, etc.)
-5. Remediation succeeds → taint removed → node back in rotation
-
-Gang members on other nodes timeout after `gangTimeout`, fail with `isFatal: false` (coordination failure, not hardware), no quarantine.
-
 ### Helm Values
 
 ```yaml
@@ -511,14 +541,35 @@ preflight-injector:
   enabled: false  # Opt-in
   
   checks:
-    - dcgm-diag
-    - nccl-loopback
-    # - nccl-allreduce  # Enable for gang workloads
+    - name: dcgm-diag
+      image: ghcr.io/nvidia/nvsentinel/preflight-dcgm-diag:v1
+    - name: nccl-loopback
+      image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:v1
+    # - name: nccl-allreduce
+    #   image: ghcr.io/nvidia/nvsentinel/preflight-nccl-allreduce:v1
+  
+  # DCGM configuration
+  dcgm:
+    hostengineAddr: "dcgm-hostengine.nvsentinel.svc:5555"  # DCGM Service address
+    diagLevel: 1       # 1 (quick, ~30s) or 2 (extended, ~2-3min)
+  
+  # NCCL test configuration
+  nccl:
+    loopbackThresholdGBps: 10.0   # Min bus bandwidth for loopback pass
+    allreduceThresholdGBps: 5.0   # Min bus bandwidth for all-reduce pass
   
-  dcgmDiagLevel: 1       # 1 (quick, ~30s) or 2 (medium, ~2-3min)
   checkTimeout: "300s"   # Per-check timeout
   gangTimeout: "600s"    # Gang coordination timeout
   
+  # Gang discovery configuration
+  gangDiscovery:
+    # Options: workloadRef, volcano, kueue, labels
+    method: "workloadRef"
+    # For label-based discovery:
+    # labels:
+    #   gangIdLabel: "app.kubernetes.io/gang-id"
+    #   gangSizeLabel: "app.kubernetes.io/gang-size"
+  
   # GPU detection configuration
   gpuDetection:
     # Extended resources (current approach)
@@ -533,10 +584,9 @@ preflight-injector:
   
   # Network device resources (for NCCL tests)
   networkDetection:
-    # Extended resources
+    # Extended resources (cluster-specific, configure for your environment)
     resourceNames:
-      - "nvidia.com/mlnxnics"
-      - "rdma/hca"
+      - "nvidia.com/mlnxnics"  # Mellanox/NVIDIA InfiniBand NICs
       # Add other network device plugin resources used in your cluster
     
     # DRA device classes (if using DRA for network devices)
@@ -578,58 +628,59 @@ preflight-injector:
   
   webhook:
     failurePolicy: Fail  # or Ignore
-  
-  image:
-    repository: ghcr.io/nvidia/nvsentinel/preflight-checker
-    tag: v1
 ```
 
 All GPU pods in listed namespaces get the configured checks.
 
 ### Metrics
 
-**preflight/checker** (exposed via pushgateway or scraped from pod annotations):
+**Check containers** (exposed via pushgateway or scraped from pod annotations):
+
+| Metric                             | Type      | Labels                        |
+|------------------------------------|-----------|-------------------------------|
+| `preflight_check_total`            | Counter   | `check`, `result`             |
+| `preflight_check_duration_seconds` | Histogram | `check`                       |
+| `preflight_check_failures_total`   | Counter   | `check`, `node`, `error_code` |
+| `preflight_gang_wait_seconds`      | Histogram | `workload`                    |
+| `preflight_config_errors_total`    | Counter   | `error`                       |
 
-| Metric | Type | Labels |
-|--------|------|--------|
-| `preflight_check_total` | Counter | `check`, `result` |
-| `preflight_check_duration_seconds` | Histogram | `check` |
-| `preflight_check_failures_total` | Counter | `check`, `node`, `error_code` |
-| `preflight_gang_wait_seconds` | Histogram | `workload` |
-| `preflight_config_errors_total` | Counter | `error` |
 
 **preflight/injector** (standard Prometheus endpoint):
 
-| Metric | Type | Labels |
-|--------|------|--------|
-| `preflight_injection_total` | Counter | `result` |
-| `preflight_webhook_latency_seconds` | Histogram | - |
+| Metric                              | Type      | Labels   |
+|-------------------------------------|-----------|----------|
+| `preflight_injection_total`         | Counter   | `result` |
+| `preflight_webhook_latency_seconds` | Histogram | -        |
+
 
 ## Rationale
 
-- Mutating webhook, no external dependencies
-- Init containers
+- Mutating webhook for transparent injection
+- Non-privileged init containers (DCGM diag runs via remote hostengine)
 - Namespace selector opt-in
-- Deployment-level config
+- Deployment-level config (no per-workload changes)
 
 ## Consequences
 
 ### Positive
 - Catches GPU failures before workload starts
 - Works with any workload controller
+- Unprivileged init container (uses DCGM hostengine)
 - Built-in NCCL topology files for major cloud platforms
 
 ### Negative
 - Adds 30-60s pod startup latency (DCGM diag level 1)
-- Requires privileged init container for DCGM
+- Requires DCGM hostengine DaemonSet for diag checks
 - Webhook downtime blocks pod creation (if `failurePolicy: Fail`)
 - NCCL tests require network device plugins (InfiniBand/RDMA) to be configured
+- Gang-wide NCCL tests require K8s 1.35+ (`workloadRef`)
 
 ### Mitigations
 - **Latency**: Use DCGM level 1 (~30s) vs level 2 (~2-3min); skip expensive checks for non-critical workloads
-- **Privileged**: Required for hardware access; limit to specific namespaces
+- **DCGM dependency**: Most GPU clusters already run DCGM for monitoring; expose as Service
 - **Webhook availability**: HA deployment (replicas, PDB); `failurePolicy: Ignore` for graceful degradation
 - **Network resources**: NCCL tests skipped if network devices unavailable; DCGM diag runs regardless
+- **K8s version**: NCCL loopback (single-node) works without `workloadRef`; gang tests are opt-in
 
 ## Alternatives Considered
 

From bead7f41538b226eed785e93304055d431d13862 Mon Sep 17 00:00:00 2001
From: Ajay Mishra <ajmishra@nvidia.com>
Date: Fri, 16 Jan 2026 17:45:17 +0530
Subject: [PATCH 08/11] chore: address review comments

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
---
 docs/designs/026-preflight-checks.md | 94 +++++++++++++---------------
 1 file changed, 42 insertions(+), 52 deletions(-)

diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md
index 7a2a24caa..de9de858f 100644
--- a/docs/designs/026-preflight-checks.md
+++ b/docs/designs/026-preflight-checks.md
@@ -33,20 +33,18 @@ Implement a MutatingAdmissionWebhook that injects preflight check init container
 
 ### Components
 
-Each check is a separate image. Webhook injects one init container per enabled check.
-
 ```
 preflight/
-├── injector/
-│   └── pkg/
-│       ├── webhook/
-│       └── injection/
-│
-├── controller/
-│   └── pkg/
-│       ├── gang/
-│       └── coordination/
-│
+└── controller/              # Webhook + gang controller (controller-runtime)
+    ├── Dockerfile
+    ├── main.go
+    └── pkg/
+        ├── webhook/         # Admission handler
+        ├── injection/       # Pod mutation, DRA detection
+        ├── gang/            # Gang discovery implementations
+        └── coordination/    # ConfigMap management
+
+preflight-checks/
 ├── dcgm-diag/
 │   ├── Dockerfile
 │   ├── main.go
@@ -138,13 +136,17 @@ initContainers:
       - name: platform-connector-socket
         mountPath: /var/run
 
-  - name: preflight-nccl-loopback
-    image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:v1
+  - name: preflight-nccl-allreduce
+    image: ghcr.io/nvidia/nvsentinel/preflight-nccl-allreduce:v1
     env:
-      - name: NCCL_LOOPBACK_THRESHOLD_GBPS
-        value: "10.0"
-      - name: PLATFORM_CONNECTOR_SOCKET
-        value: "unix:///var/run/nvsentinel.sock"
+      - name: NCCL_ALLREDUCE_THRESHOLD_GBPS
+        value: "5.0"
+      - name: GANG_TIMEOUT
+        value: "600s"
+      - name: MY_POD_NAME
+        valueFrom:
+          fieldRef:
+            fieldPath: metadata.name
     resources:
       limits:
         nvidia.com/gpu: 8
@@ -152,6 +154,8 @@ initContainers:
     volumeMounts:
       - name: platform-connector-socket
         mountPath: /var/run
+      - name: preflight-gang-config      # ConfigMap mounted as volume
+        mountPath: /etc/preflight
 ```
 
 ### Resource handling
@@ -310,36 +314,36 @@ Controller selects implementation based on Helm config. If no gang identifier fo
 
 ### Gang Coordination
 
-For gang-wide checks like `nccl-allreduce`, the preflight controller maintains a ConfigMap with peer registration and NCCL bootstrap data. Pods only read it.
+For gang-wide checks like `nccl-allreduce`, the preflight controller maintains a ConfigMap. Webhook mounts it as a volume; init containers read from filesystem.
 
 ```mermaid
 sequenceDiagram
     participant C as Preflight Controller
+    participant K as Kubelet
     participant P0 as Pod 0 Init
     participant P1 as Pod 1 Init
-    participant API as Kube API
-    participant CM as ConfigMap
 
-    C->>API: Create/Update ConfigMap (expected=2, peers="")
-    C->>API: Update ConfigMap: add pod-0:10.0.1.5
-    C->>API: Update ConfigMap: add pod-1:10.0.1.6
-    C->>API: Update ConfigMap: set nccl_unique_id
+    C->>C: Create ConfigMap (expected=2)
+    C->>C: Update ConfigMap: add pod-0:10.0.1.5
+    C->>C: Update ConfigMap: add pod-1:10.0.1.6
+    C->>C: Update ConfigMap: set nccl_unique_id
+
+    K->>P0: Sync ConfigMap to volume
+    K->>P1: Sync ConfigMap to volume
 
-    P0->>API: Read ConfigMap until len(peers) == expected
-    P1->>API: Read ConfigMap until len(peers) == expected
+    P0->>P0: Read /etc/preflight/peers until len == expected
+    P1->>P1: Read /etc/preflight/peers until len == expected
 
     Note over P0,P1: Determine rank by sorting pod names
 
-    P0->>P1: nccl.init() (barrier inside NCCL)
-    P0->>P1: nccl.all_reduce()
+    P0->>P1: nccl.init() + nccl.all_reduce()
 ```
 
-**Peer registration (controller-managed):**
-- Preflight controller creates/updates ConfigMap `preflight-<gangID>` with `expected_count`
-- `gangID` derived from gang discoverer (e.g., `workload-name/pod-group`, `volcano-pg-name`, `kueue-workload-name`)
-- Controller watches pods in the gang and updates `peers` and `nccl_unique_id`
-- Init containers read/poll ConfigMap until all peers are registered
-- Each pod determines rank by sorting pod names alphabetically
+**Flow:**
+1. Controller creates/updates ConfigMap `preflight-<gangID>` with `expected_count`, `peers`, `nccl_unique_id`
+2. Webhook mounts ConfigMap as volume at `/etc/preflight/`
+3. Init containers poll filesystem until all peers registered (kubelet syncs ~1 min)
+4. Each pod determines rank by sorting pod names alphabetically
 
 **ConfigMap structure:**
 ```yaml
@@ -347,25 +351,21 @@ apiVersion: v1
 kind: ConfigMap
 metadata:
   name: preflight-myworkload-group1
-  ownerReferences:
-    - apiVersion: scheduling.k8s.io/v1alpha1
-      kind: Workload
-      name: myworkload
 data:
   expected_count: "2"
   peers: |
     pod-0:10.0.1.5
     pod-1:10.0.1.6
-  nccl_unique_id: "base64..."  # Added by controller
+  nccl_unique_id: "base64..."
 ```
 
-**Security:** Init containers read the ConfigMap only. Controller owns write access.
+**Benefits:** Init containers need no RBAC — just read files.
 
 **Gang coordination timeout:** 10 minutes. If gang doesn't form, init fails with `isFatal: false` (not a hardware issue).
 
 ### RBAC
 
-Controller needs write access; init containers only need read. Both use ClusterRole since they operate across workload namespaces.
+Only the controller needs RBAC. Init containers read from mounted volume (no API access).
 
 **Controller ClusterRole:**
 ```yaml
@@ -384,16 +384,6 @@ rules:
 
 Controller only touches ConfigMaps with `preflight-` prefix (enforced by code).
 
-**Init container ClusterRole:**
-```yaml
-rules:
-  - apiGroups: [""]
-    resources: ["configmaps"]
-    verbs: ["get"]
-```
-
-Init containers poll ConfigMap until all peers are registered.
-
 ### DRA Integration
 
 For pods using Dynamic Resource Allocation (DRA), the webhook copies resource claim references to the init container.

From 63ae2ec34227e20ebe611950b72608267f41f720 Mon Sep 17 00:00:00 2001
From: Ajay Mishra <ajmishra@nvidia.com>
Date: Mon, 19 Jan 2026 10:11:31 +0530
Subject: [PATCH 09/11] chore: address review comments

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
---
 docs/designs/026-preflight-checks.md | 62 +++++++++++++++++-----------
 1 file changed, 38 insertions(+), 24 deletions(-)

diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md
index de9de858f..eb921a07d 100644
--- a/docs/designs/026-preflight-checks.md
+++ b/docs/designs/026-preflight-checks.md
@@ -116,7 +116,7 @@ webhooks:
 
 ### Injected init containers (sketch)
 
-One init container per enabled check:
+One init container per enabled check with be prepended to the pod's init containers:
 
 ```yaml
 initContainers:
@@ -154,7 +154,7 @@ initContainers:
     volumeMounts:
       - name: platform-connector-socket
         mountPath: /var/run
-      - name: preflight-gang-config      # ConfigMap mounted as volume
+      - name: preflight-gang-config      # ConfigMap: peers, master_addr, rank, world_size
         mountPath: /etc/preflight
 ```
 
@@ -227,19 +227,36 @@ Tests cross-node GPU collective communication over RDMA/InfiniBand.
 
 **How it works:**
 1. **Gang formation**: All pods register in shared ConfigMap (see Gang Coordination section)
-2. **Rank assignment**: Sort pod names alphabetically → rank 0, 1, 2, ...
-3. **NCCL bootstrap**: Controller generates NCCL unique ID, writes to ConfigMap
-4. **Run test**: Each pod reads ConfigMap and runs `all_reduce_perf` independently
-
-**Command:**
-```bash
-NCCL_COMM_ID=<nccl_unique_id> \
-NCCL_NRANKS=$WORLD_SIZE \
-NCCL_RANK=$MY_RANK \
-all_reduce_perf -b 8 -e 256M -f 2 -g $GPUS_PER_NODE
+2. **Wait for peers**: Each init container polls ConfigMap (mounted volume) until all peers registered
+3. **Bootstrap via TCP**: Rank 0's IP from ConfigMap; PyTorch/NCCL handles handshake
+4. **Run test**: Each init container runs PyTorch all-reduce independently; NCCL coordinates internally
+
+**Test script (PyTorch-based, no MPI needed):**
+```python
+import torch
+import torch.distributed as dist
+import os
+
+# Read from mounted ConfigMap
+rank = int(os.environ['MY_RANK'])
+world_size = int(os.environ['WORLD_SIZE'])
+master_addr = os.environ['MASTER_ADDR']  # Rank 0's IP from ConfigMap
+
+# PyTorch handles NCCL bootstrap via TCP
+dist.init_process_group(
+    backend='nccl',
+    init_method=f'tcp://{master_addr}:29500',
+    rank=rank,
+    world_size=world_size
+)
+
+# Run all-reduce test, measure bandwidth
+tensor = torch.ones(256 * 1024 * 1024, device='cuda')  # 1GB
+dist.all_reduce(tensor)
+# ... measure time, calculate bandwidth, compare to threshold
 ```
 
-Each init container runs independently. NCCL handles cross-node coordination via the shared `NCCL_COMM_ID`.
+Each init container runs independently.
 
 **What it catches:**
 - InfiniBand/RDMA link failures
@@ -248,10 +265,9 @@ Each init container runs independently. NCCL handles cross-node coordination via
 - NCCL algorithm/protocol issues
 
 **Requirements:**
-- `workloadRef` for gang discovery (K8s 1.35+)
+- Gang discovery (`workloadRef`, Volcano, or Kueue)
 - Network device allocation (InfiniBand NICs)
 - NCCL topology file (auto-detected or user-provided)
-- ConfigMap RBAC for coordination
 
 **Timeout handling:**
 - `GANG_TIMEOUT` sets max wait for all peers to register
@@ -323,10 +339,9 @@ sequenceDiagram
     participant P0 as Pod 0 Init
     participant P1 as Pod 1 Init
 
-    C->>C: Create ConfigMap (expected=2)
+    C->>C: Create ConfigMap (expected=2, master_addr=10.0.1.5)
     C->>C: Update ConfigMap: add pod-0:10.0.1.5
     C->>C: Update ConfigMap: add pod-1:10.0.1.6
-    C->>C: Update ConfigMap: set nccl_unique_id
 
     K->>P0: Sync ConfigMap to volume
     K->>P1: Sync ConfigMap to volume
@@ -336,14 +351,17 @@ sequenceDiagram
 
     Note over P0,P1: Determine rank by sorting pod names
 
-    P0->>P1: nccl.init() + nccl.all_reduce()
+    P0->>P0: PyTorch init (rank=0, listens on :29500)
+    P1->>P0: PyTorch init (rank=1, connects to master_addr:29500)
+    P0->>P1: NCCL all_reduce over RDMA
 ```
 
 **Flow:**
-1. Controller creates/updates ConfigMap `preflight-<gangID>` with `expected_count`, `peers`, `nccl_unique_id`
+1. Controller creates/updates ConfigMap `preflight-<gangID>` with `expected_count`, `peers`, `master_addr`
 2. Webhook mounts ConfigMap as volume at `/etc/preflight/`
 3. Init containers poll filesystem until all peers registered (kubelet syncs ~1 min)
 4. Each pod determines rank by sorting pod names alphabetically
+5. PyTorch connects to `master_addr` for NCCL bootstrap (TCP), then NCCL uses RDMA
 
 **ConfigMap structure:**
 ```yaml
@@ -353,20 +371,16 @@ metadata:
   name: preflight-myworkload-group1
 data:
   expected_count: "2"
+  master_addr: "10.0.1.5"  # Rank 0's IP for PyTorch TCP bootstrap
   peers: |
     pod-0:10.0.1.5
     pod-1:10.0.1.6
-  nccl_unique_id: "base64..."
 ```
 
-**Benefits:** Init containers need no RBAC — just read files.
-
 **Gang coordination timeout:** 10 minutes. If gang doesn't form, init fails with `isFatal: false` (not a hardware issue).
 
 ### RBAC
 
-Only the controller needs RBAC. Init containers read from mounted volume (no API access).
-
 **Controller ClusterRole:**
 ```yaml
 rules:

From 98664db4614dc26b6577003619599f009e1034e5 Mon Sep 17 00:00:00 2001
From: Ajay Mishra <ajmishra@nvidia.com>
Date: Mon, 19 Jan 2026 10:18:25 +0530
Subject: [PATCH 10/11] chore: add overall flow diagram

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
---
 docs/designs/026-preflight-checks.md | 76 +++++++++++++++++++++++++---
 1 file changed, 68 insertions(+), 8 deletions(-)

diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md
index eb921a07d..5006da405 100644
--- a/docs/designs/026-preflight-checks.md
+++ b/docs/designs/026-preflight-checks.md
@@ -63,18 +63,78 @@ preflight-checks/
     └── pkg/
 ```
 
-### Webhook flow
+### Overall flow
 
 ```mermaid
-flowchart TD
-    A[Pod CREATE request] --> B{GPU resources?}
-    B -->|No| C[Allow]
-    B -->|Yes| D[Inject init containers]
-    D --> E[Return JSON patch]
+stateDiagram-v2
+    [*] --> PodCreated: User creates GPU pod
+
+    state "Webhook Injection" as Webhook {
+        PodCreated --> CheckGPU: Admission webhook triggered
+        CheckGPU --> Inject: GPU resources detected
+        CheckGPU --> Skip: No GPU resources
+        Skip --> [*]: Pod starts normally
+        Inject --> PodScheduled: Init containers injected
+    }
+
+    state "Init Container Execution" as InitExec {
+        PodScheduled --> DCGMDiag: Run dcgm-diag
+        
+        state "DCGM Diag" as DCGMDiag {
+            [*] --> GetGPUUUIDs: nvidia-smi query
+            GetGPUUUIDs --> RemoteDiag: dcgmi diag via hostengine
+            RemoteDiag --> DCGMPass: All tests pass
+            RemoteDiag --> DCGMFail: Test failure
+        }
+        
+        DCGMPass --> NCCLLoopback: Next check
+        DCGMFail --> ReportFailure: HealthEvent
+        
+        state "NCCL Loopback" as NCCLLoopback {
+            [*] --> RunLoopback: all_reduce_perf -g N
+            RunLoopback --> CheckBW: Measure bandwidth
+            CheckBW --> LoopbackPass: BW >= threshold
+            CheckBW --> LoopbackFail: BW < threshold
+        }
+        
+        LoopbackPass --> GangCheck: Check if gang-wide enabled
+        LoopbackFail --> ReportFailure
+        
+        GangCheck --> NCCLAllReduce: nccl-allreduce enabled
+        GangCheck --> AllPassed: Single-node only
+    }
+
+    state "Gang Coordination" as GangCoord {
+        NCCLAllReduce --> WaitPeers: Poll ConfigMap
+        WaitPeers --> PeersReady: All peers registered
+        WaitPeers --> GangTimeout: Timeout (10 min)
+        GangTimeout --> ReportTimeout: isFatal=false
+        
+        state "NCCL All-Reduce" as AllReduce {
+            PeersReady --> PyTorchInit: TCP bootstrap to master
+            PyTorchInit --> RunAllReduce: dist.all_reduce()
+            RunAllReduce --> AllReducePass: BW >= threshold
+            RunAllReduce --> AllReduceFail: BW < threshold or error
+        }
+        
+        AllReducePass --> AllPassed
+        AllReduceFail --> ReportFailure
+    }
+
+    state "Failure Handling" as FailHandle {
+        ReportFailure --> SendHealthEvent: gRPC to Platform Connector
+        ReportTimeout --> SendHealthEvent
+        SendHealthEvent --> PlatformConnector: HealthEvent published
+        PlatformConnector --> FaultQuarantine: Cordon node
+        FaultQuarantine --> NodeDrainer: Drain workloads
+        NodeDrainer --> FaultRemediation: Based on recommendedAction
+        FaultRemediation --> [*]: Node remediated or escalate
+    }
+
+    AllPassed --> MainContainerStart: Init success (exit 0)
+    MainContainerStart --> [*]: Workload runs
 ```
 
-Namespace filtering handled by `namespaceSelector` in webhook config.
-
 ### MutatingWebhookConfiguration (sketch)
 
 ```yaml

From 0066b88671c22910584bc812ead48e71cf0a47ba Mon Sep 17 00:00:00 2001
From: Ajay Mishra <ajmishra@nvidia.com>
Date: Mon, 19 Jan 2026 17:06:13 +0530
Subject: [PATCH 11/11] chore: added reason why pytorch nccl preferred and test
 result

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
---
 docs/designs/026-preflight-checks.md | 91 ++++++++++++++++++++--------
 1 file changed, 67 insertions(+), 24 deletions(-)

diff --git a/docs/designs/026-preflight-checks.md b/docs/designs/026-preflight-checks.md
index 5006da405..50dbe13aa 100644
--- a/docs/designs/026-preflight-checks.md
+++ b/docs/designs/026-preflight-checks.md
@@ -285,38 +285,81 @@ Checker validates `busbw` (bus bandwidth) against configured threshold.
 
 Tests cross-node GPU collective communication over RDMA/InfiniBand.
 
+**Why PyTorch over MPI:**
+- MPI-based tests require `pods/exec` to spawn processes on peer pods
+- `pods/exec` is high privilege — allows executing commands in any pod in the namespace
+- PyTorch's `torchrun` handles coordination via TCP without cross-pod exec
+- Each init container runs independently; NCCL uses RDMA for actual data transfer
+
 **How it works:**
 1. **Gang formation**: All pods register in shared ConfigMap (see Gang Coordination section)
 2. **Wait for peers**: Each init container polls ConfigMap (mounted volume) until all peers registered
-3. **Bootstrap via TCP**: Rank 0's IP from ConfigMap; PyTorch/NCCL handles handshake
-4. **Run test**: Each init container runs PyTorch all-reduce independently; NCCL coordinates internally
+3. **torchrun bootstrap**: Each pod runs `torchrun` connecting to master (rank 0) via TCP
+4. **Single communicator**: All GPUs form one NCCL communicator (e.g., 2 nodes × 8 GPUs = 16 ranks)
+5. **Run test**: `dist.all_reduce()` runs across all ranks; NCCL uses RDMA
 
-**Test script (PyTorch-based, no MPI needed):**
+**Test script (PyTorch-based):**
 ```python
-import torch
+#!/usr/bin/env python3
+"""
+NCCL All-Reduce benchmark - single communicator spanning all GPUs.
+Env vars set by torchrun: RANK, LOCAL_RANK, WORLD_SIZE
+"""
+import os, time, torch
 import torch.distributed as dist
-import os
-
-# Read from mounted ConfigMap
-rank = int(os.environ['MY_RANK'])
-world_size = int(os.environ['WORLD_SIZE'])
-master_addr = os.environ['MASTER_ADDR']  # Rank 0's IP from ConfigMap
-
-# PyTorch handles NCCL bootstrap via TCP
-dist.init_process_group(
-    backend='nccl',
-    init_method=f'tcp://{master_addr}:29500',
-    rank=rank,
-    world_size=world_size
-)
-
-# Run all-reduce test, measure bandwidth
-tensor = torch.ones(256 * 1024 * 1024, device='cuda')  # 1GB
-dist.all_reduce(tensor)
-# ... measure time, calculate bandwidth, compare to threshold
+
+def benchmark_allreduce(size_bytes, iters=20, warmup=5):
+    local_rank = int(os.environ.get("LOCAL_RANK", 0))
+    tensor = torch.randn(size_bytes // 4, dtype=torch.float32, 
+                         device=f"cuda:{local_rank}")
+    
+    for _ in range(warmup):
+        dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
+    torch.cuda.synchronize()
+    
+    start = time.perf_counter()
+    for _ in range(iters):
+        dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
+    torch.cuda.synchronize()
+    elapsed = time.perf_counter() - start
+    
+    world_size = dist.get_world_size()
+    algo_bw = (size_bytes * iters) / elapsed / 1e9
+    bus_bw = algo_bw * (2 * (world_size - 1) / world_size)
+    return bus_bw
+
+def main():
+    dist.init_process_group(backend="nccl")
+    torch.cuda.set_device(int(os.environ.get("LOCAL_RANK", 0)))
+    
+    bus_bw = benchmark_allreduce(4 * 1024**3)  # 4GB
+    threshold = float(os.environ.get("BW_THRESHOLD_GBPS", "100"))
+    
+    if dist.get_rank() == 0 and bus_bw < threshold:
+        # Report failure to Platform Connector via gRPC
+        ...
+    
+    dist.destroy_process_group()
+
+if __name__ == "__main__":
+    main()
+```
+
+**Invocation (per pod):**
+```bash
+torchrun --nnodes=$NNODES --nproc_per_node=$GPUS_PER_NODE \
+  --node_rank=$MY_RANK --master_addr=$MASTER_ADDR --master_port=29500 \
+  /scripts/bench.py
 ```
 
-Each init container runs independently.
+Each pod runs `torchrun` independently. No MPI, no `pods/exec`, no special RBAC.
+
+**Benchmark results (Azure NDv4, A100):**
+
+| Nodes | MPI-based (GB/s) | PyTorch (GB/s) |
+|-------|------------------|----------------|
+| 2     | 164              | 169            |
+| 3     | 160              | 168            |
 
 **What it catches:**
 - InfiniBand/RDMA link failures