Skip to content

Conversation

@tanishagoyal2
Copy link
Contributor

@tanishagoyal2 tanishagoyal2 commented Dec 22, 2025

Summary

Type of Change

  • πŸ› Bug fix
  • ✨ New feature
  • πŸ’₯ Breaking change
  • πŸ“š Documentation
  • πŸ”§ Refactoring
  • πŸ”¨ Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Testing

  1. Tested on dev cluster nvs-dgxc-k8s-oci-lhr-dev3 by installing nvsentinel from main changes
  2. Updated gpu-health-monitor image with this branch changes and running gpu-monitor in STORE_ONLY strategy
  3. Injected gpu inforom error
  4. Event was created with STORE_ONLY strategy
Screenshot 2026-01-12 at 9 03 00β€―PM
  1. Event exporter also exported event with STORE_ONLY strategy
Screenshot 2026-01-12 at 9 06 40β€―PM
  1. Nodes were not cordoned and node condition was not applied

Summary by CodeRabbit

  • New Features

    • Add configurable GPU health event processing strategy (EXECUTE_REMEDIATION default or STORE_ONLY) via CLI and deployment values; strategy is embedded in emitted health events.
  • Tests

    • Updated/added tests to verify propagation of STORE_ONLY and EXECUTE_REMEDIATION across multiple health-event scenarios and connectivity transitions; new end-to-end coverage for store-only path.
  • Chores

    • Deployment manifests and test harness updated to accept and pass the new strategy flag.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 22, 2025

πŸ“ Walkthrough

Walkthrough

Adds a processingStrategy option (EXECUTE_REMEDIATION | STORE_ONLY) exposed in Helm values and DaemonSet args, a CLI flag with runtime validation, propagated into PlatformConnectorEventProcessor and embedded in emitted HealthEvent messages; tests and Go test helpers updated to exercise the store-only path.

Changes

Cohort / File(s) Summary
Kubernetes templates
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml, distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
Inserted --processing-strategy container arg sourced from .Values.processingStrategy.
Helm values
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
Added processingStrategy (string) with default EXECUTE_REMEDIATION and documented options EXECUTE_REMEDIATION / STORE_ONLY.
Python CLI
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
Added --processing-strategy flag, runtime validation, parse to proto enum, and pass into _init_event_processor; adjusted import alias for proto.
Event processor (Python)
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
PlatformConnectorEventProcessor.__init__ accepts and stores processing_strategy; emitted HealthEvent messages include processingStrategy.
Python tests
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
Updated to construct processor with processing_strategy and assert emitted events carry matching processingStrategy values.
Go integration tests
tests/gpu_health_monitor_test.go
Added container/daemonset name constants, context keys, and new TestGpuHealthMonitorStoreOnlyEvents (duplicate implementation present).
Go test helpers & tests
tests/helpers/syslog-health-monitor.go, tests/syslog_health_monitor_test.go
SetUpSyslogHealthMonitor signature gains setManagedByNVSentinel bool; call sites updated to pass flag (false or true for STORE_ONLY) and some one-shot checks replaced by time-bounded assertions.

Sequence Diagram(s)

sequenceDiagram
  participant Helm as Helm (values.yaml)
  participant K8s as Kubernetes (DaemonSet)
  participant Pod as Container (gpu-health-monitor)
  participant CLI as CLI parser
  participant Proc as PlatformConnectorEventProcessor
  participant Sink as Platform Connector / Event Sink

  Helm->>K8s: render `processingStrategy` into DaemonSet args
  K8s->>Pod: start container with `--processing-strategy`
  Pod->>CLI: parse args and validate enum
  CLI->>Proc: construct processor with processing_strategy
  Proc->>Sink: emit HealthEvent { processingStrategy: ... }
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • lalitadithya
  • XRFXLP

Poem

🐰 I hopped through charts and CLI flags bright,
I tucked a strategy in pod-time light.
EXECUTE or STORE, I gently decide,
Each event now wears my tiny guide.
Hooray β€” the monitor hops with pride!

πŸš₯ Pre-merge checks | βœ… 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.30% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
βœ… Passed checks (2 passed)
Check name Status Explanation
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check βœ… Passed The title accurately summarizes the main change: adding event handling strategy configuration to the GPU health monitor, which is reflected across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • πŸ“ Generate docstrings


πŸ“œ Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 80494cc and 9424707.

πŸ“’ Files selected for processing (2)
  • tests/gpu_health_monitor_test.go
  • tests/syslog_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/syslog_health_monitor_test.go
  • tests/gpu_health_monitor_test.go

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

πŸ”§ golangci-lint (2.5.0)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)

325-343: Add godoc for the exported function.

The function filterProcessableEvents is exported (starts with uppercase) but lacks a proper godoc comment. Per coding guidelines, function comments are required for all exported Go functions. The comment on line 325 is present but should follow godoc format.

Also, consider nil-safety: if healthEvents or healthEvents.Events is nil, this could panic.

πŸ”Ž Suggested improvement
-// filterProcessableEvents filters out STORE_ONLY events that should not create node conditions or K8s events.
-func filterProcessableEvents(healthEvents *protos.HealthEvents) []*protos.HealthEvent {
+// filterProcessableEvents filters out STORE_ONLY events that should not create node conditions or K8s events.
+// It returns only events with processing strategy other than STORE_ONLY.
+func filterProcessableEvents(healthEvents *protos.HealthEvents) []*protos.HealthEvent {
+	if healthEvents == nil || len(healthEvents.Events) == 0 {
+		return nil
+	}
+
 	var processableEvents []*protos.HealthEvent
 
 	for _, healthEvent := range healthEvents.Events {
tests/gpu_health_monitor_test.go (1)

502-510: Teardown clears "Memory" error that was never injected.

The setup only injects an Inforom error (field 84), but the teardown clearCommands includes both Inforom and Memory errors. While this is harmless (clearing a non-existent error is a no-op), it adds unnecessary operations. Consider matching the teardown to what was actually injected.

πŸ”Ž Suggested simplification
 		clearCommands := []struct {
 			name      string
 			fieldID   string
 			value     string
 			condition string
 		}{
 			{"Inforom", "84", "1", "GpuInforomWatch"},
-			{"Memory", "395", "0", "GpuMemWatch"},
 		}
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (1)

125-132: Consider logging the strategy name instead of the integer value.

Line 132 logs processing_strategy_value, which is the integer enum value (e.g., 0 or 1). For better readability in logs, consider logging the original string name or using ProcessingStrategy.Name().

πŸ”Ž Suggested improvement
-    log.info(f"Event handling strategy configured to: {processing_strategy_value}")
+    log.info(f"Event handling strategy configured to: {processing_strategy}")

This logs the human-readable string (e.g., EXECUTE_REMEDIATION) instead of the integer value.

tests/helpers/kube.go (2)

2306-2307: Hardcoded sleep after rollout completion is fragile.

The 10-second sleep after waitForDaemonSetRollout adds unnecessary delay. The rollout wait already ensures pods are ready. If additional startup time is genuinely needed, consider polling for a specific readiness indicator instead.

πŸ”Ž Suggested improvement

Remove the hardcoded sleep or replace it with a condition-based wait if there's a specific startup behavior to await:

 	t.Logf("Waiting for daemonset %s/%s rollout to complete", NVSentinelNamespace, daemonsetName)
 	waitForDaemonSetRollout(ctx, t, client, daemonsetName)

-	t.Logf("Waiting 10 seconds for daemonset pods to start")
-	time.Sleep(10 * time.Second)
-
 	return originalDaemonSet, nil

2270-2272: Add godoc comments for exported functions.

Per coding guidelines, exported Go functions require function comments. The new exported functions UpdateDaemonSetProcessingStrategy, RestoreDaemonSet, and GetDaemonSetPodOnWorkerNode are missing godoc comments.

πŸ”Ž Suggested improvement
+// UpdateDaemonSetProcessingStrategy updates the specified container in a DaemonSet to use
+// STORE_ONLY processing strategy, waits for rollout completion, and returns the original DaemonSet.
 func UpdateDaemonSetProcessingStrategy(ctx context.Context, t *testing.T,
 	client klient.Client, daemonsetName string, containerName string) (*appsv1.DaemonSet, error) {
+// RestoreDaemonSet restores a DaemonSet's containers to their original state and waits for rollout.
 func RestoreDaemonSet(ctx context.Context, t *testing.T, client klient.Client,
 	originalDaemonSet *appsv1.DaemonSet, daemonsetName string,
 ) error {
+// GetDaemonSetPodOnWorkerNode returns a running, ready pod from the DaemonSet on a real worker node.
 func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client,
 	daemonsetName string, podNamePattern string) (*v1.Pod, error) {

As per coding guidelines, function comments are required for all exported Go functions.

Also applies to: 2312-2314, 2346-2348

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 61f47cb and 8cd1107.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (23)
  • data-models/protobufs/health_event.proto
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
  • event-exporter/pkg/transformer/cloudevents.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • fault-quarantine/pkg/initializer/init.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • store-client/pkg/client/pipeline_builder.go
  • store-client/pkg/client/pipeline_builder_test.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • tests/event_exporter_test.go
  • tests/gpu_health_monitor_test.go
  • tests/helpers/healthevent.go
  • tests/helpers/kube.go
🧰 Additional context used
πŸ““ Path-based instructions (7)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-quarantine/pkg/initializer/init.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • store-client/pkg/client/pipeline_builder.go
  • store-client/pkg/client/pipeline_builder_test.go
  • tests/helpers/healthevent.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • tests/event_exporter_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • tests/helpers/kube.go
  • event-exporter/pkg/transformer/cloudevents.go
  • tests/gpu_health_monitor_test.go
data-models/protobufs/**/*.proto

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

  • data-models/protobufs/health_event.proto
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • store-client/pkg/client/pipeline_builder_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • tests/event_exporter_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • tests/gpu_health_monitor_test.go
**/values.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
**/daemonset*.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Explain DaemonSet variant selection logic in Helm chart documentation

Files:

  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
distros/kubernetes/**/*daemonset*.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes using nodeAffinity based on kata.enabled label
Regular node DaemonSets should use /var/log volume mount for file-based logs
Kata node DaemonSets should use /run/log/journal and /var/log/journal volume mounts for systemd journal

Files:

  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
**/*.py

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
🧠 Learnings (6)
πŸ“š Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

  • fault-quarantine/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/event_exporter_test.go
  • tests/gpu_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/event_exporter_test.go
  • tests/gpu_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `testify/assert` and `testify/require` for assertions in Go tests

Applied to files:

  • tests/event_exporter_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • tests/event_exporter_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation

Applied to files:

  • tests/helpers/kube.go
🧬 Code graph analysis (10)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
  • ToPipeline (161-163)
  • D (131-133)
  • E (126-128)
  • A (136-138)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
store-client/pkg/client/postgresql_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
  • ToPipeline (161-163)
  • D (131-133)
  • E (126-128)
  • A (136-138)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
store-client/pkg/client/pipeline_builder_test.go (3)
store-client/pkg/client/pipeline_builder.go (1)
  • PipelineBuilder (26-47)
store-client/pkg/client/mongodb_pipeline_builder.go (1)
  • NewMongoDBPipelineBuilder (29-31)
store-client/pkg/client/postgresql_pipeline_builder.go (1)
  • NewPostgreSQLPipelineBuilder (29-31)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
  • ValidateCloudEvent (221-245)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
data-models/pkg/protos/health_event.pb.go (11)
  • HealthEvents (156-162)
  • HealthEvents (175-175)
  • HealthEvents (190-192)
  • HealthEvent (260-280)
  • HealthEvent (293-293)
  • HealthEvent (308-310)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • ProcessingStrategy_STORE_ONLY (47-47)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (5)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • ProcessingStrategy_STORE_ONLY (47-47)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
event-exporter/pkg/transformer/cloudevents.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
tests/gpu_health_monitor_test.go (3)
tests/helpers/kube.go (10)
  • UpdateDaemonSetProcessingStrategy (2271-2310)
  • GetDaemonSetPodOnWorkerNode (2346-2390)
  • NVSentinelNamespace (64-64)
  • SetNodeManagedByNVSentinel (1365-1384)
  • ExecInPod (1539-1574)
  • EnsureNodeConditionNotPresent (1773-1794)
  • RestoreDaemonSet (2312-2344)
  • CheckNodeConditionExists (1259-1276)
  • EventuallyWaitTimeout (61-61)
  • WaitInterval (63-63)
tests/helpers/metadata.go (3)
  • CreateTestMetadata (59-108)
  • InjectMetadata (110-181)
  • DeleteMetadata (183-229)
tests/helpers/fault_quarantine.go (2)
  • AssertQuarantineState (315-382)
  • QuarantineAssertion (56-60)
πŸ”‡ Additional comments (44)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)

1391-1506: Well-structured test suite for processing strategy.

The test suite properly validates the new ProcessingStrategy filtering behavior with good coverage:

  • STORE_ONLY events (fatal and non-fatal) correctly don't create node conditions or K8s events
  • EXECUTE_REMEDIATION events properly create node conditions
  • Mixed strategy scenarios validate selective processing

The table-driven approach and per-test isolation with local clientsets align well with the coding guidelines.


1508-1587: Test implementation looks correct.

The test harness correctly:

  • Creates isolated environments per test case
  • Filters standard K8s node conditions (NodeReady, NodeMemoryPressure, etc.) when counting NVSentinel conditions
  • Validates both the presence/absence of node conditions and K8s events

One minor observation: the description field is logged at the end but could be incorporated into the assertion failure messages for better debugging context, though the current approach is acceptable.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)

372-416: Implementation correctly integrates filtering.

The processHealthEvents function properly:

  1. Filters events using filterProcessableEvents first
  2. Uses only processableEvents for both node condition updates and K8s event creation
  3. Maintains the existing logic for fatal vs non-fatal event handling

One edge case to consider: if processableEvents is empty (all events are STORE_ONLY), the function returns early without errors, which appears to be the intended behavior.

event-exporter/pkg/transformer/cloudevents.go (1)

66-66: Correctly propagates processingStrategy to CloudEvent.

The addition of processingStrategy to the CloudEvent data payload is consistent with how other enum fields (e.g., recommendedAction on line 61) are handled, using .String() for serialization.

fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)

263-263: Test correctly updated for new processingStrategy field.

The expected map now includes processingStrategy: float64(0), which correctly reflects:

  1. The default enum value EXECUTE_REMEDIATION = 0 when no explicit value is set
  2. The JSON unmarshaling behavior where numbers become float64 in interface{}
store-client/pkg/client/pipeline_builder_test.go (1)

69-86: Test follows established patterns correctly.

The new test TestProcessableHealthEventInsertsPipeline is well-structured and consistent with the existing test patterns in this file:

  • Uses table-driven tests for both MongoDB and PostgreSQL implementations
  • Properly validates pipeline is non-nil, non-empty, and has exactly 1 stage
  • Uses require for critical assertions and assert for validations

As per coding guidelines, consider using a more descriptive test name format like TestPipelineBuilder_ProcessableHealthEventInserts to align with TestFunctionName_Scenario_ExpectedBehavior.

fault-quarantine/pkg/initializer/init.go (1)

66-66: LGTM! Pipeline switch correctly filters for processable events.

The change from BuildAllHealthEventInsertsPipeline() to BuildProcessableHealthEventInsertsPipeline() correctly ensures that fault-quarantine only processes health events with processingStrategy=EXECUTE_REMEDIATION, ignoring observability-only (STORE_ONLY) events.

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml (1)

61-62: LGTM! Processing strategy argument properly configured.

The --processing-strategy argument is correctly added with the value sourced from .Values.processingStrategy and properly quoted. This aligns with the PR objectives to enable configurable event handling strategy.

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml (1)

57-62: LGTM! Well-documented configuration option.

The processingStrategy configuration is clearly documented with:

  • Valid values (EXECUTE_REMEDIATION, STORE_ONLY)
  • Default value that maintains backward compatibility
  • Clear explanations of each mode's behavior

This follows the coding guidelines for Helm chart documentation.

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml (1)

61-62: LGTM! Consistent with DCGM 3.x template.

The --processing-strategy argument is added consistently with the DCGM 3.x DaemonSet template, ensuring uniform behavior across DCGM versions.

store-client/pkg/client/postgresql_pipeline_builder.go (2)

19-19: LGTM! Import added for ProcessingStrategy enum.

The import of github.com/nvidia/nvsentinel/data-models/pkg/protos is necessary to reference the ProcessingStrategy_EXECUTE_REMEDIATION enum value used in the new pipeline.


119-132: LGTM! Pipeline correctly filters for processable events.

The new BuildProcessableHealthEventInsertsPipeline() method:

  • Follows the established pipeline pattern from BuildAllHealthEventInsertsPipeline()
  • Correctly filters INSERT operations where processingStrategy equals EXECUTE_REMEDIATION
  • Uses the appropriate int32 cast for the protobuf enum value

This enables PostgreSQL change streams to ignore observability-only events (STORE_ONLY).

event-exporter/pkg/transformer/cloudevents_test.go (2)

69-69: LGTM! Test correctly initializes ProcessingStrategy.

The test event properly sets ProcessingStrategy: pb.ProcessingStrategy_STORE_ONLY to validate that the new field is handled correctly during CloudEvent transformation.


106-108: LGTM! Test validates ProcessingStrategy propagation.

The assertion correctly verifies that the processingStrategy field appears in the CloudEvent payload with the expected string value "STORE_ONLY", ensuring proper transformation from the protobuf enum.

store-client/pkg/client/pipeline_builder.go (1)

35-38: LGTM! Interface extension well-documented.

The new BuildProcessableHealthEventInsertsPipeline() method:

  • Is clearly documented with its purpose and use case
  • Explains the filtering behavior (processingStrategy=EXECUTE_REMEDIATION)
  • References the consumer (fault-quarantine)
  • Follows the documentation pattern of other interface methods
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (2)

14-17: LGTM! ProcessingStrategy enum properly defined.

The new ProcessingStrategy enum class with EXECUTE_REMEDIATION and STORE_ONLY members is correctly structured according to Python protobuf stub conventions.


31-32: LGTM! HealthEvent ProcessingStrategy integration complete.

The processingStrategy field is properly integrated into the HealthEvent message with:

  • Module-level enum constants (lines 31-32)
  • Field in __slots__ (line 78)
  • Field number constant (line 104)
  • Type annotation (line 120)
  • __init__ parameter with correct typing (line 138)

This generated stub correctly reflects the protobuf schema changes.

Also applies to: 78-78, 104-104, 120-120, 138-138

tests/event_exporter_test.go (1)

25-26: Import addition looks correct.

The import of "tests/helpers" is properly added and used throughout the test file for helper functions.

data-models/protobufs/health_event.proto (2)

32-38: Well-designed enum with appropriate default value.

Using EXECUTE_REMEDIATION = 0 as the default is correct designβ€”existing clients that don't set the field will get the expected remediation behavior. The comments clearly explain each strategy's semantics, which aligns with the coding guidelines for comprehensive field documentation in Protocol Buffer messages.


77-77: Field addition looks correct.

The new processingStrategy field at position 16 maintains backward compatibility and follows the existing field numbering sequence.

store-client/pkg/client/mongodb_pipeline_builder.go (2)

87-100: Pipeline implementation is correct.

The method follows existing patterns and correctly filters for EXECUTE_REMEDIATION events using the protobuf enum's integer value. This ensures only events intended for remediation are processed by downstream consumers.


19-19: Import addition is appropriate.

The protos import is required to reference ProcessingStrategy_EXECUTE_REMEDIATION constant.

tests/helpers/healthevent.go (2)

48-48: Field addition follows existing patterns.

The ProcessingStrategy field with omitempty JSON tag is consistent with other optional fields in the struct. The default zero value maps to EXECUTE_REMEDIATION, which is the expected default behavior.


153-156: Builder method follows established conventions.

The WithProcessingStrategy method maintains consistency with other builder methods in the file.

tests/gpu_health_monitor_test.go (2)

414-463: Test structure and setup are well-organized.

The test properly:

  1. Updates the DaemonSet to use STORE_ONLY strategy
  2. Waits for rollout completion
  3. Injects a GPU error to trigger event generation
  4. Stores context values for teardown

The use of UpdateDaemonSetProcessingStrategy and GetDaemonSetPodOnWorkerNode helpers keeps the test readable.


465-481: Test assertions correctly validate STORE_ONLY behavior.

The test verifies that with STORE_ONLY processing strategy:

  • Node conditions are NOT applied
  • Node is NOT cordoned
  • No quarantine annotation is present

This validates the core feature that STORE_ONLY events are observed but don't modify cluster state.

health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (5)

107-108: Processing strategy parameter correctly added to test initialization.

The test now passes platformconnector_pb2.STORE_ONLY to the processor, enabling verification of the new processing strategy flow.


301-302: Good assertion for processingStrategy propagation.

Verifying that nvlink_failure_event.processingStrategy == platformconnector_pb2.STORE_ONLY confirms the strategy is correctly propagated through the event pipeline to the generated HealthEvent.


523-524: Test uses EXECUTE_REMEDIATION for connectivity restored scenario.

This is appropriateβ€”the connectivity restored test uses EXECUTE_REMEDIATION which tests the alternative processing path, providing coverage for both enum values.


549-549: Assertion validates EXECUTE_REMEDIATION propagation.

This confirms the test covers the EXECUTE_REMEDIATION strategy path in the restored connectivity scenario.


493-493: Comprehensive assertion for DCGM connectivity failure event.

The test verifies all expected fields including the new processingStrategy field, ensuring the complete event structure is validated.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

1-51: LGTM - Generated protobuf code.

This is an auto-generated file from the protobuf compiler. The changes correctly reflect the addition of the ProcessingStrategy enum and the processingStrategy field (field number 16) to the HealthEvent message, as indicated by the updated serialized descriptor and adjusted byte ranges.

health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (3)

24-24: LGTM - Import alias consistent with existing usage.

The alias platformconnector_pb2 for health_event_pb2 maintains consistency with the existing import pattern in platform_connector.py.


74-80: LGTM - CLI option well-defined.

The new --processing-strategy option is appropriately configured with a sensible default (EXECUTE_REMEDIATION) and a clear help string describing the valid values.


28-51: LGTM - Parameter threading through event processor initialization.

The processing_strategy parameter is correctly added to _init_event_processor signature and passed through to PlatformConnectorEventProcessor.

tests/helpers/kube.go (4)

2208-2249: LGTM - DaemonSet rollout wait logic.

The rollout status checks correctly verify that all desired pods are scheduled, updated, and ready before considering the rollout complete.


2251-2268: LGTM - Container argument update handles both formats.

The helper correctly handles both --processing-strategy=VALUE and --processing-strategy VALUE argument formats, and appends the argument if not present.


2312-2344: LGTM - RestoreDaemonSet correctly handles nil input.

The function appropriately handles the case where originalDaemonSet is nil by logging and returning early.


2346-2390: LGTM - GetDaemonSetPodOnWorkerNode with proper readiness checks.

The function correctly verifies the pod is not being deleted, is in Running phase, and passes all readiness conditions before returning.

health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (5)

41-66: LGTM - Processing strategy parameter added to constructor.

The processing_strategy parameter is correctly typed with platformconnector_pb2.ProcessingStrategy and stored as an instance variable for use in event creation.


106-121: LGTM - Processing strategy propagated in connectivity restored events.

The processingStrategy field is correctly included in the HealthEvent for DCGM connectivity restoration.


206-223: LGTM - Processing strategy propagated in health failure events.

The processingStrategy field is correctly included when creating health events for GPU failures.


270-287: LGTM - Processing strategy propagated in health recovery events.

The processingStrategy field is correctly included when creating health events indicating GPU recovery.


366-381: LGTM - Processing strategy propagated in connectivity failure events.

The processingStrategy field is correctly included in the HealthEvent for DCGM connectivity failures.

@github-actions
Copy link

Merging this branch will decrease overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/fault-quarantine/pkg/evaluator 41.95% (-0.57%) πŸ‘Ž
github.com/nvidia/nvsentinel/fault-quarantine/pkg/initializer 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/tests 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ΓΈ)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/fault-quarantine/pkg/initializer/init.go 0.00% (ΓΈ) 280 0 280
github.com/nvidia/nvsentinel/tests/helpers/healthevent.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ΓΈ) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • github.com/nvidia/nvsentinel/tests/event_exporter_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_health_monitor_test.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
tests/helpers/event_exporter.go (1)

221-226: Add function documentation per coding guidelines.

The exported function ValidateCloudEvent lacks a godoc comment. As per coding guidelines, "Function comments required for all exported Go functions." Please add documentation explaining the function's purpose and parameters.

πŸ”Ž Suggested documentation
+// ValidateCloudEvent validates that a CloudEvent has the expected structure and content.
+// It checks the CloudEvent spec version, type, source, and validates the embedded HealthEvent
+// fields including node name, message, check name, error code, and processing strategy.
 func ValidateCloudEvent(
 	t *testing.T,
 	event map[string]any,
 	expectedNodeName, expectedMessage, expectedCheckName, expectedErrorCode string,
 	expectedProcessingStrategy string,
 ) {

Based on coding guidelines: "Function comments required for all exported Go functions"

tests/helpers/kube.go (4)

2257-2269: Unused variable originalDaemonSet.

The originalDaemonSet variable is assigned on line 2268 but never used. This appears to be dead code, possibly leftover from a previous implementation that intended to restore the original state.

πŸ”Ž Proposed fix
 	err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		daemonSet := &appsv1.DaemonSet{}
 		if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil {
 			return err
 		}
 
-		if originalDaemonSet == nil {
-			originalDaemonSet = daemonSet.DeepCopy()
-		}
-
 		containers := daemonSet.Spec.Template.Spec.Containers

2288-2289: Consider removing redundant sleep.

The 10-second sleep appears redundant since waitForDaemonSetRollout already waits until all pods are ready (NumberReady == DesiredNumberScheduled). If there's a specific edge case requiring this delay (e.g., waiting for readiness probes to stabilize), consider documenting it; otherwise, this could be removed.


2294-2327: Inconsistent error handling pattern.

This function mixes two error handling approaches: it returns an error but also calls require.NoError on line 2319 which will fail the test immediately. This is inconsistent with UpdateDaemonSetArgs which only returns errors.

If the function returns an error, callers should handle it. Using require.NoError makes the return value meaningless since the test fails before returning.

πŸ”Ž Proposed fix - Option A: Return error consistently (like UpdateDaemonSetArgs)
 	err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		daemonSet := &appsv1.DaemonSet{}
 		if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil {
 			return err
 		}
 
 		containers := daemonSet.Spec.Template.Spec.Containers
 
 		for i := range containers {
 			if containers[i].Name == containerName {
 				removeArgsFromContainer(&containers[i], args)
 				break
 			}
 		}
 
 		return client.Resources().Update(ctx, daemonSet)
 	})
-	require.NoError(t, err, "failed to remove args from daemonset %s/%s", NVSentinelNamespace, daemonsetName)
+	if err != nil {
+		return fmt.Errorf("failed to remove args from daemonset %s/%s: %w", NVSentinelNamespace, daemonsetName, err)
+	}

2409-2453: Parameter daemonsetName is only used in error message, not for pod selection.

The function signature accepts daemonsetName but it's only used in the error message on line 2449. The actual pod selection uses podNamePattern via GetPodOnWorkerNode. This could be misleading since the function name implies it validates the pod belongs to the specified DaemonSet.

Consider either:

  1. Removing the daemonsetName parameter if pattern matching is sufficient
  2. Adding validation that the pod is actually owned by the specified DaemonSet
πŸ”Ž Option 1: Remove unused parameter
-func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client,
-	daemonsetName string, podNamePattern string) (*v1.Pod, error) {
+func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client,
+	podNamePattern string) (*v1.Pod, error) {
 	t.Helper()
 
 	var resultPod *v1.Pod
 
 	require.Eventually(t, func() bool {
 		// Get the pod
 		pod, err := GetPodOnWorkerNode(ctx, t, client, NVSentinelNamespace, podNamePattern)
 		if err != nil {
 			t.Logf("Failed to get pod: %v", err)
 			return false
 		}
         ...
 	}, EventuallyWaitTimeout, WaitInterval, "daemonset pod from current rollout should be running and ready")
 
 	if resultPod == nil {
-		return nil, fmt.Errorf("failed to get ready pod for daemonset %s", daemonsetName)
+		return nil, fmt.Errorf("failed to get ready pod matching pattern %s", podNamePattern)
 	}
 
 	return resultPod, nil
 }
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 8cd1107 and 8398875.

πŸ“’ Files selected for processing (3)
  • tests/gpu_health_monitor_test.go
  • tests/helpers/event_exporter.go
  • tests/helpers/kube.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/gpu_health_monitor_test.go
🧰 Additional context used
πŸ““ Path-based instructions (1)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/helpers/event_exporter.go
  • tests/helpers/kube.go
🧠 Learnings (1)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation

Applied to files:

  • tests/helpers/kube.go
πŸ”‡ Additional comments (5)
tests/helpers/event_exporter.go (1)

245-245: LGTM! Validation follows the established pattern.

The validation of processingStrategy correctly follows the same pattern as other field validations in this function, using require.Equal consistently.

tests/helpers/kube.go (4)

2208-2249: LGTM!

The waitForDaemonSetRollout function correctly polls the DaemonSet status and follows the same patterns as the existing WaitForDeploymentRollout function. The rollout completion checks for DesiredNumberScheduled, UpdatedNumberScheduled, and NumberReady are appropriate.


2329-2357: LGTM!

The tryUpdateExistingArg helper correctly handles both --flag=value and --flag value argument styles. The slice manipulation for inserting a value after a flag is correct.


2359-2381: LGTM!

The function correctly sets container arguments, leveraging tryUpdateExistingArg to handle existing args and appending new ones as needed.


2383-2407: LGTM!

The function correctly handles removal of both --flag=value and --flag value style arguments, appropriately breaking after modification to avoid issues with slice iteration.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-gpu-monitor branch from 8398875 to 232c636 Compare December 25, 2025 12:17
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (8)
tests/helpers/healthevent.go (1)

48-48: Consider using the protobuf enum type for type safety.

The ProcessingStrategy field uses int instead of protos.ProcessingStrategy. While this provides flexibility in tests, it loses type safety and may allow invalid values.

Consider whether type safety is valuable here:

πŸ”Ž Option to use the enum type

If you prefer compile-time type checking:

+import "github.com/nvidia/nvsentinel/data-models/pkg/protos"
+
 type HealthEventTemplate struct {
 	Version             int                  `json:"version"`
 	Agent               string               `json:"agent"`
 	ComponentClass      string               `json:"componentClass,omitempty"`
 	CheckName           string               `json:"checkName"`
 	IsFatal             bool                 `json:"isFatal"`
 	IsHealthy           bool                 `json:"isHealthy"`
 	Message             string               `json:"message"`
 	RecommendedAction   int                  `json:"recommendedAction,omitempty"`
 	ErrorCode           []string             `json:"errorCode,omitempty"`
 	EntitiesImpacted    []EntityImpacted     `json:"entitiesImpacted,omitempty"`
 	Metadata            map[string]string    `json:"metadata,omitempty"`
 	QuarantineOverrides *QuarantineOverrides `json:"quarantineOverrides,omitempty"`
 	NodeName            string               `json:"nodeName"`
-	ProcessingStrategy  int                  `json:"processingStrategy,omitempty"`
+	ProcessingStrategy  protos.ProcessingStrategy `json:"processingStrategy,omitempty"`
 }

Then update the builder:

-func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate {
+func (h *HealthEventTemplate) WithProcessingStrategy(strategy protos.ProcessingStrategy) *HealthEventTemplate {
 	h.ProcessingStrategy = strategy
 	return h
 }
tests/helpers/kube.go (1)

2312-2313: Consider if the 10-second sleep is necessary.

After waitForDaemonSetRollout completes, all pods are confirmed updated and ready. The additional 10-second sleep may be unnecessary unless there's a specific stabilization requirement not covered by the readiness checks.

If the sleep is for pod initialization beyond readiness, consider adding a comment explaining why. Otherwise, this delay might be removable:

 	t.Logf("Waiting for daemonset %s/%s rollout to complete", NVSentinelNamespace, daemonsetName)
 	waitForDaemonSetRollout(ctx, t, client, daemonsetName)
 
-	t.Logf("Waiting 10 seconds for daemonset pods to start")
-	time.Sleep(10 * time.Second)
-
 	return nil
 }
docs/postgresql-schema.sql (1)

106-109: Consider adding constraints for data integrity.

The processing_strategy column is nullable and has no constraints. Consider adding:

  1. A CHECK constraint to ensure only valid enum values are stored
  2. A NOT NULL constraint with a default value if every health event should have a strategy
πŸ”Ž Option to add constraints

If you want to enforce valid values at the database level:

     -- Metadata
     created_at TIMESTAMPTZ DEFAULT NOW(),
-    updated_at TIMESTAMPTZ DEFAULT NOW(),
+    updated_at TIMESTAMPTZ DEFAULT NOW() NOT NULL,
 
     -- Event handling strategy
-    processing_strategy VARCHAR(50)
+    processing_strategy VARCHAR(50) NOT NULL DEFAULT 'EXECUTE_REMEDIATION' 
+        CHECK (processing_strategy IN ('EXECUTE_REMEDIATION', 'STORE_ONLY'))
 );

This prevents invalid values and ensures consistency, but reduces flexibility if new enum values are added later without a migration.

tests/gpu_health_monitor_test.go (1)

488-489: Use defined constants instead of hardcoded strings.

Lines 488-489 use hardcoded strings "gpu-health-monitor-dcgm-4.x" and "gpu-health-monitor" instead of the constants GPUHealthMonitorDaemonSetName and GPUHealthMonitorContainerName defined at lines 42-43.

πŸ”Ž Suggested fix
-		err = helpers.RemoveDaemonSetArgs(ctx, t, client, "gpu-health-monitor-dcgm-4.x", "gpu-health-monitor", map[string]string{
+		err = helpers.RemoveDaemonSetArgs(ctx, t, client, GPUHealthMonitorDaemonSetName, GPUHealthMonitorContainerName, map[string]string{
 			"--processing-strategy": "EXECUTE_REMEDIATION"})
tests/platform-connector_test.go (2)

28-32: Remove unused struct fields.

ConfigMapBackup and TestNamespace fields are defined but never used in the test. Consider removing them to keep the code clean.

πŸ”Ž Suggested fix
 type PlatformConnectorTestContext struct {
 	NodeName        string
-	ConfigMapBackup []byte
-	TestNamespace   string
 }

98-101: Teardown sends healthy event but doesn't verify cleanup of STORE_ONLY events.

The teardown only sends a healthy event. Consider verifying that any state from the STORE_ONLY test cases is properly cleaned up, or add a comment explaining why no cleanup verification is needed.

health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (1)

125-132: Good error handling for invalid processing strategy.

The validation correctly:

  1. Uses protobuf's Value() method to validate and convert the string
  2. Catches ValueError for invalid inputs
  3. Logs all valid options to help users correct their configuration
  4. Exits with code 1 on invalid input

One minor note: Line 132 logs processing_strategy_value which is the integer enum value. Consider logging the string name for better readability.

πŸ”Ž Optional: Log the strategy name for better readability
-    log.info(f"Event handling strategy configured to: {processing_strategy_value}")
+    log.info(f"Event handling strategy configured to: {processing_strategy}")
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (1)

125-135: Consider adding processingStrategy assertion to first test case.

The test_health_event_occurred test at lines 125-135 verifies event properties but doesn't assert on processingStrategy. While other tests cover this, adding an assertion here would ensure complete coverage.

πŸ”Ž Suggested addition
             if event.checkName == "GpuInforomWatch" and event.isHealthy == False:
                 assert event.errorCode[0] == "DCGM_FR_CORRUPT_INFOROM"
                 assert event.entitiesImpacted[0].entityValue == "0"
                 assert event.recommendedAction == platformconnector_pb2.RecommendedAction.COMPONENT_RESET
+                assert event.processingStrategy == platformconnector_pb2.STORE_ONLY
             else:
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 8398875 and 232c636.

πŸ“’ Files selected for processing (17)
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
  • distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml
  • docs/postgresql-schema.sql
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • fault-quarantine/pkg/initializer/init.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
  • tests/event_exporter_test.go
  • tests/fault_quarantine_test.go
  • tests/gpu_health_monitor_test.go
  • tests/helpers/event_exporter.go
  • tests/helpers/healthevent.go
  • tests/helpers/kube.go
  • tests/platform-connector_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • fault-quarantine/pkg/initializer/init.go
  • tests/event_exporter_test.go
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
🧰 Additional context used
πŸ““ Path-based instructions (4)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/helpers/event_exporter.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • tests/helpers/healthevent.go
  • tests/gpu_health_monitor_test.go
  • tests/helpers/kube.go
  • tests/fault_quarantine_test.go
  • tests/platform-connector_test.go
**/values.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • tests/gpu_health_monitor_test.go
  • tests/fault_quarantine_test.go
  • tests/platform-connector_test.go
**/*.py

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
🧠 Learnings (8)
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • tests/helpers/event_exporter.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • tests/platform-connector_test.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • tests/gpu_health_monitor_test.go
  • tests/fault_quarantine_test.go
  • tests/platform-connector_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/gpu_health_monitor_test.go
  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/gpu_health_monitor_test.go
  • tests/fault_quarantine_test.go
  • tests/platform-connector_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label

Applied to files:

  • tests/helpers/kube.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/platform-connector_test.go
πŸ“š Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

  • tests/platform-connector_test.go
🧬 Code graph analysis (2)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
tests/fault_quarantine_test.go (4)
tests/helpers/fault_quarantine.go (4)
  • QuarantineTestContext (51-54)
  • SetupQuarantineTest (107-112)
  • AssertQuarantineState (315-382)
  • QuarantineAssertion (56-60)
tests/helpers/kube.go (1)
  • SetNodeManagedByNVSentinel (1389-1408)
tests/helpers/healthevent.go (3)
  • NewHealthEvent (60-76)
  • SendHealthEvent (263-275)
  • SendHealthyEvent (277-287)
data-models/pkg/protos/health_event.pb.go (2)
  • ProcessingStrategy_STORE_ONLY (47-47)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
πŸ”‡ Additional comments (28)
tests/helpers/kube.go (5)

387-409: LGTM!

The EnsureNodeEventNotPresent helper correctly uses require.Never to assert that a specific event type and reason combination never appears on a node during the test window. The implementation properly queries events and checks both type and reason.


2232-2273: LGTM!

The waitForDaemonSetRollout function correctly implements DaemonSet rollout verification by checking that all desired pods are scheduled, updated, and ready. The logic matches kubectl rollout status behavior and includes helpful progress logging.


2318-2351: LGTM!

The RemoveDaemonSetArgs function correctly removes specified arguments from a DaemonSet container and waits for the rollout to complete. Uses retry.RetryOnConflict appropriately without error wrapping.


2353-2431: LGTM!

The argument manipulation helpers (tryUpdateExistingArg, setArgsOnContainer, removeArgsFromContainer) correctly handle various command-line argument formats:

  • --flag=value style
  • --flag value style (separate entries)
  • --flag style (boolean flags)

The logic properly preserves argument order and handles edge cases like updating existing args vs adding new ones.


2433-2477: LGTM!

The GetDaemonSetPodOnWorkerNode helper correctly retrieves a ready DaemonSet pod with proper validation:

  • Verifies pod is not being deleted (DeletionTimestamp == nil)
  • Confirms pod is in Running phase
  • Checks pod readiness conditions
  • Uses require.Eventually for reliable polling

This defensive checking improves test stability by ensuring the pod from the current rollout is fully operational.

fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)

263-263: LGTM!

The test correctly updates the expected map to include the new processingStrategy field with value float64(0), matching the default enum value ProcessingStrategy_EXECUTE_REMEDIATION. Using float64 is correct for JSON unmarshaling behavior.

distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml (1)

219-222: LGTM!

The schema addition matches the canonical source in docs/postgresql-schema.sql. The file header correctly documents the sync process using make update-helm-postgres-schema and make validate-postgres-schema.

tests/helpers/event_exporter.go (2)

245-245: LGTM!

The assertion correctly validates that the processingStrategy field in the CloudEvent data matches the expected value. The comparison works properly since the JSON-unmarshaled value will be a string.


221-226: All callers of ValidateCloudEvent have been properly updated. The function call at tests/event_exporter_test.go:85 correctly passes all 7 required parameters, including the new expectedProcessingStrategy parameter ("EXECUTE_REMEDIATION"). No outdated calls remain in the codebase.

tests/fault_quarantine_test.go (4)

233-250: LGTM!

The test setup correctly:

  • Uses the existing SetupQuarantineTest helper
  • Sets the node as managed by NVSentinel (required for quarantine logic)
  • Follows the established pattern from other tests in this file

252-268: LGTM!

The STORE_ONLY assessment correctly validates that events with ProcessingStrategy_STORE_ONLY:

  • Do NOT cause the node to be cordoned
  • Do NOT add quarantine annotations

This properly tests the observability-only behavior where events are stored but don't modify cluster state.


270-286: LGTM!

The EXECUTE_REMEDIATION assessment correctly validates that events with ProcessingStrategy_EXECUTE_REMEDIATION:

  • DO cause the node to be cordoned
  • DO add quarantine annotations

This properly tests the normal remediation behavior where the system takes corrective actions.


288-295: LGTM!

The teardown correctly:

  • Sends a healthy event to clear the quarantine state from the EXECUTE_REMEDIATION assessment
  • Uses TeardownQuarantineTest to restore the original configuration and clean up test resources

This ensures proper test isolation and cleanup.

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml (1)

57-62: LGTM! Clear documentation and sensible default.

The processingStrategy configuration option is well-documented with:

  • Clear valid values: EXECUTE_REMEDIATION, STORE_ONLY
  • Sensible default: EXECUTE_REMEDIATION (maintains backward compatibility)
  • Clear behavior description for each mode

Both DaemonSet templates (daemonset-dcgm-3.x.yaml and daemonset-dcgm-4.x.yaml) correctly reference this value using {{ .Values.processingStrategy | quote }}.

tests/gpu_health_monitor_test.go (3)

34-48: LGTM on constants organization.

The constants are well-organized with exported names for reuse. The separation of DCGM-related constants from GPU health monitor constants improves readability.


413-462: Test setup and error injection logic is well-structured.

The test correctly:

  1. Configures the DaemonSet with STORE_ONLY strategy
  2. Waits for the pod to be ready
  3. Injects test metadata and sets the node label
  4. Injects a DCGM Inforom error to trigger the health monitor

The flow aligns with the PR objective of verifying STORE_ONLY events are stored without triggering remediation.


464-480: LGTM on assess phase.

The assertions correctly verify that:

  1. Node conditions are not applied when using STORE_ONLY strategy
  2. Node is not cordoned

This validates the expected behavior of the STORE_ONLY processing strategy.

tests/platform-connector_test.go (1)

53-96: Test thoroughly covers both processing strategies.

The assess phase correctly validates:

  1. STORE_ONLY events don't apply node conditions or emit events
  2. EXECUTE_REMEDIATION events do apply conditions and events

The test uses both fatal (ERRORCODE_79) and non-fatal (ERRORCODE_31) error codes to cover different scenarios.

health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (2)

74-80: CLI option for processing strategy is well-designed.

The option:

  • Has a sensible default (EXECUTE_REMEDIATION) for backward compatibility
  • Provides clear help text describing valid values
  • Is marked as optional

37-51: Function signature updated correctly with type hint.

The processing_strategy parameter is properly typed with platformconnector_pb2.ProcessingStrategy and passed through to the PlatformConnectorEventProcessor constructor. As per coding guidelines, type hints are required for all functions in Python code.

health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (3)

107-108: LGTM on test update for processing_strategy parameter.

The test correctly passes platformconnector_pb2.STORE_ONLY as the processing strategy to the event processor constructor.


301-302: Good assertion on processingStrategy propagation.

The test verifies that the processingStrategy field on the emitted HealthEvent matches the strategy configured in the processor (STORE_ONLY).


523-549: Test correctly verifies EXECUTE_REMEDIATION strategy propagation.

This test case uses EXECUTE_REMEDIATION strategy and verifies the restored event has the correct processingStrategy field. Good coverage of both strategy values across different test cases.

health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (5)

51-66: Processing strategy parameter and storage are correctly implemented.

The constructor:

  1. Accepts the typed processing_strategy parameter
  2. Stores it as self._processing_strategy following Python naming conventions for protected attributes

As per coding guidelines, type hints are required for all functions in Python code, which is satisfied here.


106-121: processingStrategy correctly added to connectivity restored event.

The clear_dcgm_connectivity_failure method properly includes processingStrategy=self._processing_strategy in the HealthEvent message.


206-223: processingStrategy correctly added to health event for entity failures.

The HealthEvent created when entity failures are detected properly includes the processing strategy.


270-287: processingStrategy correctly added to healthy status events.

The HealthEvent created for healthy (PASS) status properly includes the processing strategy.


366-381: processingStrategy correctly added to DCGM connectivity failure event.

The dcgm_connectivity_failed method properly includes processingStrategy=self._processing_strategy in the HealthEvent message. All four HealthEvent creation sites now consistently include the processing strategy.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-gpu-monitor branch from 232c636 to 9c0336d Compare January 12, 2026 04:49
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (1)

42-66: Add return type hints to all methods and parameter type hints to error_code parameter.

Per the coding guidelines requiring "Type hints required for all functions in Python code," several methods in the class are missing return type annotations:

  • clear_dcgm_connectivity_failure()
  • health_event_occurred()
  • get_recommended_action_from_dcgm_error_map() (also missing type hint for error_code parameter)
  • send_health_event_with_retries()
  • dcgm_connectivity_failed()

The processing_strategy parameter integration itself is correctβ€”all instantiations provide the parameter, and it's properly used in HealthEvent creations. However, the class must be updated to fully comply with type hint requirements.

πŸ€– Fix all issues with AI agents
In @docs/designs/025-processing-strategy-for-health-checks.md:
- Around line 598-608: The pipeline example uses string values for
healthevent.processingstrategy; change those comparisons to use the integer
constant used in the implementation by replacing "EXECUTE_REMEDIATION" with
int32(protos.ProcessingStrategy_EXECUTE_REMEDIATION) (and ensure any other
processingstrategy comparisons follow the same pattern), keeping the rest of the
pipeline (including the $exists false branch) intact; update the pipeline
variable and any example snippets referencing healthevent.processingstrategy
accordingly.

In
@platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go:
- Around line 1620-1626: The test accesses events.Items[0] without ensuring
events.Items is non-empty which can panic; update the test in the block using
tc.expectKubernetesEvents to first assert or require that len(events.Items) > 0
(or use assert.NotEmpty/require.NotEmpty) before reading events.Items[0], then
compare events.Items[0].Type to tc.expectedEventType; similarly, if you assert
non-empty use the appropriate testing helper (require.* if you want to stop on
failure) so the subsequent access is safe.
- Around line 1606-1612: The test currently accesses nvsentinelConditions[0]
when tc.expectNodeConditions is true, which can panic if the slice is empty;
update the assertion to first assert that len(nvsentinelConditions) > 0 (e.g.,
assert.Greater(t, len(nvsentinelConditions), 0, ...)) before comparing
string(nvsentinelConditions[0].Type) to tc.expectedConditionType so the test
fails cleanly rather than panicking; ensure messages reference
nvsentinelConditions and tc.expectedConditionType for clarity.
🧹 Nitpick comments (10)
tests/helpers/kube.go (4)

387-409: Missing function comment for exported function.

Per coding guidelines, all exported Go functions require function comments. Also, the log message on line 405 could be more precise by including the eventReason.

Suggested improvement
+// EnsureNodeEventNotPresent asserts that a node does NOT have an event with the specified type and reason
+// within the NeverWaitTimeout period.
 func EnsureNodeEventNotPresent(ctx context.Context, t *testing.T,
 	c klient.Client, nodeName string, eventType, eventReason string) {
 	t.Helper()
 	// ... existing code ...
-		t.Logf("node %s does not have event %v", nodeName, eventType)
+		t.Logf("node %s does not have event type=%s reason=%s", nodeName, eventType, eventReason)

2275-2316: Unused variable and hardcoded sleep.

  1. originalDaemonSet (line 2281) is assigned but never used - appears to be dead code.
  2. The hardcoded time.Sleep(10 * time.Second) on line 2313 is a code smell. If rollout is complete, pods should already be ready. Consider removing this or documenting why it's necessary.
  3. Missing function comment for exported function (per coding guidelines).
Suggested fix
+// UpdateDaemonSetArgs updates the specified container's arguments in a DaemonSet
+// and waits for the rollout to complete.
 func UpdateDaemonSetArgs(ctx context.Context, t *testing.T,
 	client klient.Client, daemonsetName string, containerName string,
 	args map[string]string) error {
 	t.Helper()
 
-	var originalDaemonSet *appsv1.DaemonSet
-
 	t.Logf("Updating daemonset %s/%s with args %v", NVSentinelNamespace, daemonsetName, args)
 
 	err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		daemonSet := &appsv1.DaemonSet{}
 		if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil {
 			return err
 		}
 
-		if originalDaemonSet == nil {
-			originalDaemonSet = daemonSet.DeepCopy()
-		}
-
 		containers := daemonSet.Spec.Template.Spec.Containers
 		// ... rest of function
 	})
 	// ...
 	waitForDaemonSetRollout(ctx, t, client, daemonsetName)
 
-	t.Logf("Waiting 10 seconds for daemonset pods to start")
-	time.Sleep(10 * time.Second)
-
 	return nil
 }

2318-2351: Inconsistent error handling pattern.

The function returns error but uses require.NoError on line 2343, which will fail the test immediately if an error occurs. This makes the error return value unreachable on failure. Either:

  • Remove the error return and use require.NoError consistently (like waitForDaemonSetRollout)
  • Or return the error and let the caller handle it

Also missing function comment for exported function.

Option 1: Remove error return (consistent with test helper pattern)
-func RemoveDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client,
+// RemoveDaemonSetArgs removes the specified arguments from a DaemonSet container
+// and waits for the rollout to complete.
+func RemoveDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client,
 	daemonsetName string,
 	containerName string, args map[string]string,
-) error {
+) {
 	t.Helper()
 	// ... existing code ...
 	require.NoError(t, err, "failed to remove args from daemonset %s/%s", NVSentinelNamespace, daemonsetName)
 	// ...
 	t.Log("DaemonSet restored successfully")
-
-	return nil
 }

2433-2477: Unused parameter and missing function comment.

The daemonsetName parameter is unused in the function body (only appears in the error message on line 2473). Either:

  • Use it to verify the pod belongs to the correct DaemonSet via owner references
  • Or remove it from the signature

Also missing function comment for exported function.

Suggested fix (verify ownership)
+// GetDaemonSetPodOnWorkerNode returns a running, ready pod from the specified DaemonSet
+// that is scheduled on a real worker node.
 func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client,
 	daemonsetName string, podNamePattern string) (*v1.Pod, error) {
 	t.Helper()
 
 	var resultPod *v1.Pod
 
 	require.Eventually(t, func() bool {
 		// Get the pod
 		pod, err := GetPodOnWorkerNode(ctx, t, client, NVSentinelNamespace, podNamePattern)
 		if err != nil {
 			t.Logf("Failed to get pod: %v", err)
 			return false
 		}
 
+		// Verify pod belongs to the expected DaemonSet
+		belongsToDaemonSet := false
+		for _, ref := range pod.OwnerReferences {
+			if ref.Kind == "DaemonSet" && ref.Name == daemonsetName {
+				belongsToDaemonSet = true
+				break
+			}
+		}
+		if !belongsToDaemonSet {
+			t.Logf("Pod %s does not belong to daemonset %s", pod.Name, daemonsetName)
+			return false
+		}
+
 		// Verify pod is not being deleted
 		// ... rest of function
data-models/protobufs/health_event.proto (1)

79-79: Add field comment for processingStrategy.

Per coding guidelines, Protocol Buffer messages should include comprehensive comments for all fields. While the enum itself is well-documented, the field on HealthEvent lacks a comment explaining its purpose.

πŸ“ Suggested comment
   BehaviourOverrides drainOverrides = 15;
-  ProcessingStrategy processingStrategy = 16;
+  // Processing strategy defines how downstream modules should handle this event.
+  // STORE_ONLY events are for observability only and should not modify cluster state.
+  ProcessingStrategy processingStrategy = 16;
 }
platform-connectors/pkg/server/platform_connector_server_test.go (1)

22-22: Consider using standard testing package instead of testify.

Based on learnings from this repository, testify should be avoided for simple equality/inequality checks. These assertions are straightforward and could use the standard testing package.

♻️ Proposed refactor using standard testing
 import (
 	"context"
 	"testing"
 
 	pb "github.com/nvidia/nvsentinel/data-models/pkg/protos"
-	"github.com/stretchr/testify/assert"
 )

Then update assertions:

-			assert.NoError(t, err)
-			assert.Equal(t, tt.expectedStrategy, healthEvents.Events[0].ProcessingStrategy)
+			if err != nil {
+				t.Errorf("unexpected error: %v", err)
+			}
+			if healthEvents.Events[0].ProcessingStrategy != tt.expectedStrategy {
+				t.Errorf("ProcessingStrategy = %v, want %v", healthEvents.Events[0].ProcessingStrategy, tt.expectedStrategy)
+			}
tests/fault_quarantine_test.go (1)

327-336: Consider explicitly setting ProcessingStrategy for the teardown healthy event.

The teardown sends a healthy event to clear the quarantine state but doesn't specify a ProcessingStrategy. Based on the AI summary, the platform connector normalizes UNSPECIFIED to EXECUTE_REMEDIATION, so this should work correctly. However, for consistency and explicit intent, consider adding .WithProcessingStrategy(int(protos.ProcessingStrategy_EXECUTE_REMEDIATION)) to ensure the healthy event clears the fault state as intended.

Based on learnings, healthy events can legitimately use EXECUTE_REMEDIATION when the Fault Quarantine Manager needs to act on them to clear previous fault states.

♻️ Suggested improvement
 feature.Teardown(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
 	event := helpers.NewHealthEvent(testCtx.NodeName).
 		WithErrorCode("79").
 		WithHealthy(true).
 		WithAgent(helpers.SYSLOG_HEALTH_MONITOR_AGENT).
-		WithCheckName("SysLogsXIDError")
+		WithCheckName("SysLogsXIDError").
+		WithProcessingStrategy(int(protos.ProcessingStrategy_EXECUTE_REMEDIATION))
 	helpers.SendHealthEvent(ctx, t, event)

 	return helpers.TeardownQuarantineTest(ctx, t, c)
 })
tests/gpu_health_monitor_test.go (1)

732-733: Use the newly defined constants instead of hardcoded strings.

The teardown uses hardcoded strings for the DaemonSet and container names, but the constants GPUHealthMonitorDaemonSetName and GPUHealthMonitorContainerName were defined at the top of the file specifically for this purpose.

♻️ Proposed fix
-		err = helpers.RemoveDaemonSetArgs(ctx, t, client, "gpu-health-monitor-dcgm-4.x", "gpu-health-monitor", map[string]string{
+		err = helpers.RemoveDaemonSetArgs(ctx, t, client, GPUHealthMonitorDaemonSetName, GPUHealthMonitorContainerName, map[string]string{
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)

1502-1524: Test case appears to duplicate case 2.

This test case ("STORE_ONLY non fatal event should not create Kubernetes event") at lines 1502-1524 is very similar to the test case at lines 1422-1442 ("STORE_ONLY non-fatal event should not create Kubernetes event"). Both test that STORE_ONLY non-fatal events don't create Kubernetes events.

If this is intentional for additional coverage with different error codes, consider making the distinction clearer. Otherwise, consider removing the duplicate.

health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (1)

125-132: Consider logging the strategy name instead of the numeric value.

Line 132 logs processing_strategy_value which is an integer (e.g., 1 for EXECUTE_REMEDIATION). For better readability in logs, consider logging the original string or the enum name.

♻️ Proposed improvement
-    log.info(f"Event handling strategy configured to: {processing_strategy_value}")
+    log.info(f"Event handling strategy configured to: {processing_strategy}")

Or to show both:

log.info(f"Event handling strategy configured to: {processing_strategy} ({processing_strategy_value})")
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 232c636 and 9c0336d.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (27)
  • data-models/protobufs/health_event.proto
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
  • docs/designs/025-processing-strategy-for-health-checks.md
  • fault-quarantine/pkg/initializer/init.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • platform-connectors/pkg/server/platform_connector_server.go
  • platform-connectors/pkg/server/platform_connector_server_test.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • store-client/pkg/client/pipeline_builder.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
  • tests/event_exporter_test.go
  • tests/fault_quarantine_test.go
  • tests/gpu_health_monitor_test.go
  • tests/helpers/event_exporter.go
  • tests/helpers/fault_quarantine.go
  • tests/helpers/kube.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • tests/helpers/event_exporter.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • tests/event_exporter_test.go
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
🧰 Additional context used
πŸ““ Path-based instructions (7)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • platform-connectors/pkg/server/platform_connector_server.go
  • fault-quarantine/pkg/initializer/init.go
  • tests/helpers/fault_quarantine.go
  • platform-connectors/pkg/server/platform_connector_server_test.go
  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
  • store-client/pkg/client/pipeline_builder.go
  • tests/gpu_health_monitor_test.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • tests/fault_quarantine_test.go
  • tests/helpers/kube.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • platform-connectors/pkg/server/platform_connector_server_test.go
  • tests/gpu_health_monitor_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • tests/fault_quarantine_test.go
**/*.py

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
**/values.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
data-models/protobufs/**/*.proto

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

  • data-models/protobufs/health_event.proto
**/daemonset*.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Explain DaemonSet variant selection logic in Helm chart documentation

Files:

  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
distros/kubernetes/**/*daemonset*.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes using nodeAffinity based on kata.enabled label
Regular node DaemonSets should use /var/log volume mount for file-based logs
Kata node DaemonSets should use /run/log/journal and /var/log/journal volume mounts for systemd journal

Files:

  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
🧠 Learnings (12)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • platform-connectors/pkg/server/platform_connector_server.go
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • tests/fault_quarantine_test.go
  • data-models/protobufs/health_event.proto
  • docs/designs/025-processing-strategy-for-health-checks.md
  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
πŸ“š Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

  • platform-connectors/pkg/server/platform_connector_server.go
  • fault-quarantine/pkg/initializer/init.go
  • docs/designs/025-processing-strategy-for-health-checks.md
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • tests/helpers/fault_quarantine.go
  • tests/gpu_health_monitor_test.go
  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • platform-connectors/pkg/server/platform_connector_server_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • data-models/protobufs/health_event.proto
  • docs/designs/025-processing-strategy-for-health-checks.md
  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • platform-connectors/pkg/server/platform_connector_server_test.go
  • tests/gpu_health_monitor_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-12-23T05:02:22.108Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: store-client/pkg/client/postgresql_pipeline_builder.go:119-132
Timestamp: 2025-12-23T05:02:22.108Z
Learning: In the NVSentinel codebase, protobuf fields stored in MongoDB should use lowercase field names (e.g., processingstrategy, componentclass, checkname). Ensure pipeline filters and queries that access protobuf fields in the database consistently use lowercase field names in the store-client package, avoiding camelCase mappings for MongoDB reads/writes.

Applied to files:

  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
  • store-client/pkg/client/pipeline_builder.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label

Applied to files:

  • tests/helpers/kube.go
πŸ“š Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

  • docs/designs/025-processing-strategy-for-health-checks.md
🧬 Code graph analysis (10)
platform-connectors/pkg/server/platform_connector_server.go (1)
data-models/pkg/protos/health_event.pb.go (6)
  • ProcessingStrategy (44-44)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (80-82)
  • ProcessingStrategy (89-91)
  • ProcessingStrategy_UNSPECIFIED (47-47)
  • ProcessingStrategy_EXECUTE_REMEDIATION (48-48)
tests/helpers/fault_quarantine.go (1)
tests/helpers/kube.go (1)
  • NVSentinelNamespace (64-64)
platform-connectors/pkg/server/platform_connector_server_test.go (1)
data-models/pkg/protos/health_event.pb.go (10)
  • ProcessingStrategy (44-44)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (80-82)
  • ProcessingStrategy (89-91)
  • ProcessingStrategy_UNSPECIFIED (47-47)
  • ProcessingStrategy_EXECUTE_REMEDIATION (48-48)
  • ProcessingStrategy_STORE_ONLY (49-49)
  • HealthEvent (264-284)
  • HealthEvent (297-297)
  • HealthEvent (312-314)
store-client/pkg/client/pipeline_builder.go (1)
store-client/pkg/client/mongodb_client.go (1)
  • BuildNonFatalUnhealthyInsertsPipeline (296-299)
tests/gpu_health_monitor_test.go (2)
tests/helpers/kube.go (7)
  • UpdateDaemonSetArgs (2276-2316)
  • GetDaemonSetPodOnWorkerNode (2433-2477)
  • NVSentinelNamespace (64-64)
  • ExecInPod (1563-1598)
  • EnsureNodeConditionNotPresent (1797-1818)
  • RemoveDaemonSetArgs (2318-2351)
  • RemoveNodeManagedByNVSentinelLabel (1411-1425)
tests/helpers/metadata.go (1)
  • CreateTestMetadata (59-108)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
  • ToPipeline (161-163)
  • D (131-133)
  • E (126-128)
  • A (136-138)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (48-48)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
data-models/pkg/protos/health_event.pb.go (7)
  • HealthEvent (264-284)
  • HealthEvent (297-297)
  • HealthEvent (312-314)
  • ProcessingStrategy (44-44)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (80-82)
  • ProcessingStrategy (89-91)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-18)
data-models/pkg/protos/health_event.pb.go (5)
  • ProcessingStrategy (44-44)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (80-82)
  • ProcessingStrategy (89-91)
  • ProcessingStrategy_EXECUTE_REMEDIATION (48-48)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-18)
data-models/pkg/protos/health_event.pb.go (5)
  • ProcessingStrategy (44-44)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (80-82)
  • ProcessingStrategy (89-91)
  • ProcessingStrategy_EXECUTE_REMEDIATION (48-48)
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-18)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (44-44)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (80-82)
  • ProcessingStrategy (89-91)
πŸͺ› markdownlint-cli2 (0.18.1)
docs/designs/025-processing-strategy-for-health-checks.md

350-350: Hard tabs
Column: 1

(MD010, no-hard-tabs)


351-351: Hard tabs
Column: 1

(MD010, no-hard-tabs)


353-353: Hard tabs
Column: 1

(MD010, no-hard-tabs)


355-355: Hard tabs
Column: 1

(MD010, no-hard-tabs)


356-356: Hard tabs
Column: 1

(MD010, no-hard-tabs)


357-357: Hard tabs
Column: 1

(MD010, no-hard-tabs)


358-358: Hard tabs
Column: 1

(MD010, no-hard-tabs)


359-359: Hard tabs
Column: 1

(MD010, no-hard-tabs)


360-360: Hard tabs
Column: 1

(MD010, no-hard-tabs)


362-362: Hard tabs
Column: 1

(MD010, no-hard-tabs)


372-372: Hard tabs
Column: 1

(MD010, no-hard-tabs)


373-373: Hard tabs
Column: 1

(MD010, no-hard-tabs)


374-374: Hard tabs
Column: 1

(MD010, no-hard-tabs)


383-383: Hard tabs
Column: 1

(MD010, no-hard-tabs)


436-436: Hard tabs
Column: 1

(MD010, no-hard-tabs)


437-437: Hard tabs
Column: 1

(MD010, no-hard-tabs)


438-438: Hard tabs
Column: 1

(MD010, no-hard-tabs)


439-439: Hard tabs
Column: 1

(MD010, no-hard-tabs)


440-440: Hard tabs
Column: 1

(MD010, no-hard-tabs)


441-441: Hard tabs
Column: 1

(MD010, no-hard-tabs)


442-442: Hard tabs
Column: 1

(MD010, no-hard-tabs)


443-443: Hard tabs
Column: 1

(MD010, no-hard-tabs)


444-444: Hard tabs
Column: 1

(MD010, no-hard-tabs)


445-445: Hard tabs
Column: 1

(MD010, no-hard-tabs)


446-446: Hard tabs
Column: 1

(MD010, no-hard-tabs)


447-447: Hard tabs
Column: 1

(MD010, no-hard-tabs)


448-448: Hard tabs
Column: 1

(MD010, no-hard-tabs)


449-449: Hard tabs
Column: 1

(MD010, no-hard-tabs)


457-457: Hard tabs
Column: 1

(MD010, no-hard-tabs)


458-458: Hard tabs
Column: 1

(MD010, no-hard-tabs)


459-459: Hard tabs
Column: 1

(MD010, no-hard-tabs)


460-460: Hard tabs
Column: 1

(MD010, no-hard-tabs)


461-461: Hard tabs
Column: 1

(MD010, no-hard-tabs)


462-462: Hard tabs
Column: 1

(MD010, no-hard-tabs)


463-463: Hard tabs
Column: 1

(MD010, no-hard-tabs)


464-464: Hard tabs
Column: 1

(MD010, no-hard-tabs)


465-465: Hard tabs
Column: 1

(MD010, no-hard-tabs)


466-466: Hard tabs
Column: 1

(MD010, no-hard-tabs)


467-467: Hard tabs
Column: 1

(MD010, no-hard-tabs)


484-484: Hard tabs
Column: 1

(MD010, no-hard-tabs)


485-485: Hard tabs
Column: 1

(MD010, no-hard-tabs)


486-486: Hard tabs
Column: 1

(MD010, no-hard-tabs)


487-487: Hard tabs
Column: 1

(MD010, no-hard-tabs)


488-488: Hard tabs
Column: 1

(MD010, no-hard-tabs)


489-489: Hard tabs
Column: 1

(MD010, no-hard-tabs)


490-490: Hard tabs
Column: 1

(MD010, no-hard-tabs)


491-491: Hard tabs
Column: 1

(MD010, no-hard-tabs)


492-492: Hard tabs
Column: 1

(MD010, no-hard-tabs)


493-493: Hard tabs
Column: 1

(MD010, no-hard-tabs)


494-494: Hard tabs
Column: 1

(MD010, no-hard-tabs)


495-495: Hard tabs
Column: 1

(MD010, no-hard-tabs)


496-496: Hard tabs
Column: 1

(MD010, no-hard-tabs)


497-497: Hard tabs
Column: 1

(MD010, no-hard-tabs)


505-505: Hard tabs
Column: 1

(MD010, no-hard-tabs)


506-506: Hard tabs
Column: 1

(MD010, no-hard-tabs)


507-507: Hard tabs
Column: 1

(MD010, no-hard-tabs)


508-508: Hard tabs
Column: 1

(MD010, no-hard-tabs)


509-509: Hard tabs
Column: 1

(MD010, no-hard-tabs)


510-510: Hard tabs
Column: 1

(MD010, no-hard-tabs)


511-511: Hard tabs
Column: 1

(MD010, no-hard-tabs)


512-512: Hard tabs
Column: 1

(MD010, no-hard-tabs)


513-513: Hard tabs
Column: 1

(MD010, no-hard-tabs)


514-514: Hard tabs
Column: 1

(MD010, no-hard-tabs)


515-515: Hard tabs
Column: 1

(MD010, no-hard-tabs)


594-594: Hard tabs
Column: 1

(MD010, no-hard-tabs)


595-595: Hard tabs
Column: 1

(MD010, no-hard-tabs)


596-596: Hard tabs
Column: 1

(MD010, no-hard-tabs)


597-597: Hard tabs
Column: 1

(MD010, no-hard-tabs)


598-598: Hard tabs
Column: 1

(MD010, no-hard-tabs)


599-599: Hard tabs
Column: 1

(MD010, no-hard-tabs)


600-600: Hard tabs
Column: 1

(MD010, no-hard-tabs)


601-601: Hard tabs
Column: 1

(MD010, no-hard-tabs)


602-602: Hard tabs
Column: 1

(MD010, no-hard-tabs)


603-603: Hard tabs
Column: 1

(MD010, no-hard-tabs)


604-604: Hard tabs
Column: 1

(MD010, no-hard-tabs)


605-605: Hard tabs
Column: 1

(MD010, no-hard-tabs)

πŸ”‡ Additional comments (46)
tests/helpers/kube.go (4)

2232-2273: LGTM!

The waitForDaemonSetRollout function correctly verifies DaemonSet rollout by checking DesiredNumberScheduled, UpdatedNumberScheduled, and NumberReady status fields. Good use of t.Helper() and comprehensive logging.


2353-2381: LGTM!

The function correctly handles multiple argument styles (--flag=value, --flag, --flag value). The slice manipulation for inserting values is correct.


2383-2405: LGTM!

The function correctly updates existing arguments or appends new ones. Good use of the helper function tryUpdateExistingArg for deduplication.


2407-2431: LGTM!

The function correctly handles removal of arguments in multiple formats. The use of break after slice modification prevents index corruption issues.

store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go (1)

404-404: LGTM!

The field name mapping is consistent with the existing pattern and correctly maps the lowercase MongoDB bson field name to the camelCase PostgreSQL JSON field name. Based on learnings, this aligns with the requirement to use lowercase field names for protobuf fields stored in MongoDB.

tests/helpers/fault_quarantine.go (1)

141-145: LGTM!

The conditional guard allows tests to skip custom configmap application while still backing up the existing configuration. This enables more flexible test setup for scenarios like TestFaultQuarantineWithProcessingStrategy where the default configuration is sufficient.

data-models/protobufs/health_event.proto (1)

32-40: LGTM!

The enum is well-designed following proto3 conventions with UNSPECIFIED=0 as the default. The comments clearly document the normalization behavior (platform-connector defaults UNSPECIFIED to EXECUTE_REMEDIATION) and the semantic distinction between the strategies.

store-client/pkg/client/mongodb_pipeline_builder.go (2)

87-113: LGTM!

The pipeline correctly uses lowercase field name processingstrategy for MongoDB queries, aligning with the codebase convention. The backward compatibility approach using $or to match both EXECUTE_REMEDIATION and missing fields ensures historical events without the field are still processed correctly.


129-156: LGTM!

This pipeline correctly extends the existing BuildNonFatalUnhealthyInsertsPipeline pattern with processingStrategy filtering. The approach maintains consistency with BuildProcessableHealthEventInsertsPipeline and preserves the agent exclusion filter for health-events-analyzer.

docs/designs/025-processing-strategy-for-health-checks.md (2)

102-104: LGTM!

The design document clearly specifies the enum values and their semantics. The UNSPECIFIED=0 default with normalization to EXECUTE_REMEDIATION is a sound design choice that ensures backward compatibility with custom monitors that don't set the field.


582-583: LGTM!

The backward compatibility explanation is clear and accurately describes why the $or pattern is needed for the health-events-analyzer queries but not for other modules that only process newly inserted events (which will always have the field set by platform-connector normalization).

platform-connectors/pkg/server/platform_connector_server.go (1)

57-62: LGTM! Sensible default for backward compatibility.

The normalization of UNSPECIFIED to EXECUTE_REMEDIATION ensures backward compatibility with custom monitors that don't explicitly set processingStrategy. The mutation occurs before pipeline processing and ring buffer enqueue, ensuring all downstream consumers see the normalized value.

health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)

362-364: Appropriate default with documented follow-up.

Hardcoding EXECUTE_REMEDIATION is correct for CSP maintenance events since they need to trigger quarantine/recovery workflows. The TODO with a specific PR reference (#641) properly tracks the planned configurability enhancement.

platform-connectors/pkg/server/platform_connector_server_test.go (1)

25-67: Well-structured table-driven test with good coverage.

The test correctly validates all three ProcessingStrategy enum values and properly verifies the in-place mutation behavior. The table-driven approach aligns with coding guidelines.

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml (1)

61-62: No action required. The processingStrategy value is properly defined in values.yaml with a sensible default (EXECUTE_REMEDIATION), with documentation explaining the available modes. The implementation is correct.

fault-quarantine/pkg/initializer/init.go (1)

66-66: Pipeline change correctly filters health events by processing strategy. The switch to BuildProcessableHealthEventInsertsPipeline() is intentional and well-designed. The function filters to only process health event inserts with EXECUTE_REMEDIATION strategy or missing strategy field (for backward compatibility with pre-upgrade events), excluding STORE_ONLY events. Both MongoDB and PostgreSQL implementations include proper backward compatibility handling for events created before the upgrade or from custom monitors.

tests/fault_quarantine_test.go (3)

234-244: LGTM! Well-structured test setup for ProcessingStrategy validation.

The test function follows the established e2e-framework pattern and correctly initializes the test context with an empty config file path, which appears intentional for testing default behavior.


246-282: LGTM! Comprehensive STORE_ONLY validation.

The test correctly verifies that events with STORE_ONLY processing strategy do not trigger node conditions, node events, or quarantine state changes. Good coverage of both fatal (SysLogsXIDError) and non-fatal (GpuPowerWatch) event types.


284-325: LGTM! Proper EXECUTE_REMEDIATION behavior validation.

The test correctly verifies that events with EXECUTE_REMEDIATION processing strategy trigger appropriate node conditions, node events, and quarantine state. The assertions cover the expected cluster state modifications.

health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (3)

243-246: LGTM! Correct ProcessingStrategy expectation for quarantine events.

The test correctly expects ProcessingStrategy_EXECUTE_REMEDIATION for maintenance events mapped to health events. This aligns with the intended behavior where CSP maintenance events should trigger remediation actions.


272-274: LGTM! Correct ProcessingStrategy for healthy events.

The healthy event expectation correctly includes ProcessingStrategy_EXECUTE_REMEDIATION. Based on learnings, healthy events can legitimately use EXECUTE_REMEDIATION when the Fault Quarantine Manager needs to act on them to clear previous fault states.


336-338: LGTM! Consistent ProcessingStrategy for unknown action fallback.

The test correctly expects ProcessingStrategy_EXECUTE_REMEDIATION even when the RecommendedAction is unknown. This maintains consistent behavior across all maintenance event mappings.

health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (6)

99-108: LGTM! Correct initialization with processing_strategy parameter.

The test correctly initializes PlatformConnectorEventProcessor with STORE_ONLY processing strategy.


301-302: LGTM! Proper verification of processingStrategy propagation.

The assertion correctly verifies that the processingStrategy is propagated through to the emitted health events.


421-421: LGTM! Consistent processingStrategy verification for multiple GPUs.

Both GPU events are correctly verified to have the expected STORE_ONLY processing strategy.

Also applies to: 436-436


493-493: LGTM! processingStrategy verified for DCGM connectivity failure events.

The assertion correctly verifies that system-level connectivity failure events also carry the configured processing strategy.


523-524: LGTM! Correct use of EXECUTE_REMEDIATION for connectivity restored events.

This test correctly uses EXECUTE_REMEDIATION for the connectivity restored scenario. Based on learnings, healthy events can legitimately use EXECUTE_REMEDIATION when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources.

Also applies to: 549-549


595-604: LGTM! Consistent processing_strategy for retry and cache cleanup test.

The test correctly initializes the processor with STORE_ONLY and focuses on retry/cache behavior rather than processing strategy propagation, which is appropriately tested elsewhere.

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml (1)

56-62: LGTM! Well-documented configuration for processing strategy.

The processingStrategy option is well-documented with valid values and clear semantics. The default to EXECUTE_REMEDIATION maintains backward compatibility. Both daemonset templates (DCGM 3.x and 4.x) correctly consume this value via the --processing-strategy command-line flag. The inline comments follow the coding guidelines for Helm chart documentation.

store-client/pkg/client/pipeline_builder.go (1)

35-46: LGTM! Well-documented interface extensions for processable event pipelines.

The new interface methods are clearly documented with their purpose and intended consumers. Both implementations are present in the MongoDB and PostgreSQL builders, and they're correctly used via the GetPipelineBuilder() interface pattern. The naming convention follows the established pattern consistently.

tests/gpu_health_monitor_test.go (3)

41-48: LGTM! New constants and context keys for STORE_ONLY test.

The new constants and context keys are well-defined and follow the existing patterns in the file.


665-706: LGTM! Test setup correctly configures STORE_ONLY strategy and injects error.

The setup properly updates the DaemonSet with the --processing-strategy STORE_ONLY flag, waits for rollout, and injects a test error to verify that the cluster state remains unaffected.


708-724: LGTM! Assess correctly validates STORE_ONLY behavior.

The assertions properly verify that STORE_ONLY events don't create node conditions or cordon the node.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

1-51: Auto-generated protobuf code - no review needed.

This file is auto-generated by the protocol buffer compiler as indicated by the header comment. The changes correctly reflect the addition of the ProcessingStrategy enum and field to the HealthEvent message, with properly adjusted serialized offsets.

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)

1550-1630: LGTM! Well-structured table-driven test for ProcessingStrategy behavior.

The test properly validates that STORE_ONLY events don't modify cluster state while EXECUTE_REMEDIATION events do. Each test case uses isolated resources, ensuring test independence.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (3)

327-345: LGTM! Clean implementation of event filtering.

The filterProcessableEvents function correctly filters out STORE_ONLY events and provides appropriate logging for skipped events. This aligns with the PR's goal of allowing events to be stored without triggering remediation.


347-372: LGTM! Well-encapsulated Kubernetes event creation.

The createK8sEvent function properly encapsulates the event creation logic. The Event.Type being set to healthEvent.CheckName is confirmed as an intentional design choice for NVSentinel, based on learnings.


374-418: LGTM! Correct integration of filtering and event creation.

The processHealthEvents function properly uses filterProcessableEvents to ensure only actionable events modify cluster state. The logic correctly handles empty processableEvents by skipping node condition updates and event creation.

health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (2)

24-24: LGTM! Import alias for ProcessingStrategy usage.

The import alias platformconnector_pb2 provides clear access to the ProcessingStrategy enum used throughout the CLI.


37-50: LGTM! Proper type hint for the new parameter.

The processing_strategy parameter has appropriate type annotation using platformconnector_pb2.ProcessingStrategy, maintaining consistency with other parameters in the function.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (2)

14-18: LGTM - ProcessingStrategy enum properly defined.

The new enum follows standard protobuf stub patterns with the three expected values (UNSPECIFIED, EXECUTE_REMEDIATION, STORE_ONLY) and corresponding module-level constants.

Also applies to: 32-34


63-141: LGTM - HealthEvent correctly extended with processingStrategy field.

The field is properly added to __slots__, field numbers, type annotations, and __init__ signature. The _Optional[_Union[ProcessingStrategy, str]] type correctly allows both enum values and string representations per protobuf conventions.

health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (4)

106-121: LGTM - processingStrategy correctly propagated to connectivity restoration event.

The strategy is properly included in the HealthEvent for clearing DCGM connectivity failures. Based on learnings, this correctly supports healthy events that may use EXECUTE_REMEDIATION to clear previous fault states.


206-223: LGTM - processingStrategy correctly propagated to failure health events.

The strategy is properly included when creating HealthEvent instances for GPU failures with entity impacts.


270-287: LGTM - processingStrategy correctly propagated to healthy state events.

The strategy is properly included when creating HealthEvent instances for healthy GPU states.


366-381: LGTM - processingStrategy correctly propagated to DCGM connectivity failure events.

All four HealthEvent creation points in this file (clear_dcgm_connectivity_failure, health_event_occurred failure path, health_event_occurred healthy path, and dcgm_connectivity_failed) are consistently updated to include the processing strategy field.

@github-actions
Copy link

πŸ›‘οΈ CodeQL Analysis

🚨 Found 1 security alert(s)

πŸ”— View details

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-gpu-monitor branch 2 times, most recently from 9c0336d to b616378 Compare January 12, 2026 14:48
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

πŸ€– Fix all issues with AI agents
In @tests/gpu_health_monitor_test.go:
- Around line 726-734: In the feature.Teardown closure, add a nil/type-safe
check when retrieving nodeName from context (replace
ctx.Value(keyNodeName).(string) with a guarded retrieval that handles nil or
wrong-type and fails the test with require.NotNil/require.IsType or similar) to
avoid a panic if setup failed; also replace the hardcoded
"gpu-health-monitor-dcgm-4.x" and "gpu-health-monitor" arguments passed to
helpers.RemoveDaemonSetArgs with the defined constants
GPUHealthMonitorDaemonSetName and GPUHealthMonitorContainerName so the teardown
uses the canonical names.
- Line 47: The constant keyOriginalDaemonSet is declared but never used; either
delete the keyOriginalDaemonSet declaration to remove dead code, or if it's
intentionally reserved for future tests, keep it and add a TODO comment on the
declaration (e.g., "// TODO: keep for future original DaemonSet comparison") so
its presence is documented.
🧹 Nitpick comments (8)
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (1)

99-108: Consider using keyword arguments for the processing_strategy parameter.

For maintainability and clarity, using keyword arguments (as done in later tests like test_dcgm_connectivity_failed) is preferable to positional arguments, especially with 8 parameters.

♻️ Suggested improvement
         platform_connector_test = platform_connector.PlatformConnectorEventProcessor(
-            socket_path,
-            node_name,
-            exit,
-            dcgm_errors_info_dict,
-            "statefile",
-            dcgm_health_conditions_categorization_mapping_config,
-            "/tmp/test_metadata.json",
-            platformconnector_pb2.STORE_ONLY,
+            socket_path=socket_path,
+            node_name=node_name,
+            exit=exit,
+            dcgm_errors_info_dict=dcgm_errors_info_dict,
+            state_file_path="statefile",
+            dcgm_health_conditions_categorization_mapping_config=dcgm_health_conditions_categorization_mapping_config,
+            metadata_path="/tmp/test_metadata.json",
+            processing_strategy=platformconnector_pb2.STORE_ONLY,
         )
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (1)

125-132: Consider improving log readability and filtering UNSPECIFIED.

Two observations:

  1. Log message readability: Line 132 logs the integer enum value. Consider logging the strategy name for better operator experience.

  2. UNSPECIFIED in valid options: ProcessingStrategy.keys() includes UNSPECIFIED, which users could technically pass. Consider whether this should be excluded from valid options.

♻️ Suggested improvement
     try:
         processing_strategy_value = platformconnector_pb2.ProcessingStrategy.Value(processing_strategy)
+        if processing_strategy_value == platformconnector_pb2.UNSPECIFIED:
+            log.fatal("UNSPECIFIED is not a valid processing_strategy. Use EXECUTE_REMEDIATION or STORE_ONLY.")
+            sys.exit(1)
     except ValueError:
-        valid_strategies = list(platformconnector_pb2.ProcessingStrategy.keys())
+        valid_strategies = [k for k in platformconnector_pb2.ProcessingStrategy.keys() if k != "UNSPECIFIED"]
         log.fatal(f"Invalid processing_strategy '{processing_strategy}'. " f"Valid options are: {valid_strategies}")
         sys.exit(1)

-    log.info(f"Event handling strategy configured to: {processing_strategy_value}")
+    log.info(f"Event handling strategy configured to: {processing_strategy}")
tests/helpers/kube.go (4)

2281-2293: Unused variable originalDaemonSet.

The originalDaemonSet variable is assigned but never used. This appears to be dead code, possibly leftover from a previous implementation that planned to restore the original state.

♻️ Suggested fix
-	var originalDaemonSet *appsv1.DaemonSet
-
 	t.Logf("Updating daemonset %s/%s with args %v", NVSentinelNamespace, daemonsetName, args)

 	err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		daemonSet := &appsv1.DaemonSet{}
 		if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil {
 			return err
 		}

-		if originalDaemonSet == nil {
-			originalDaemonSet = daemonSet.DeepCopy()
-		}
-
 		containers := daemonSet.Spec.Template.Spec.Containers

2305-2315: Consider removing unnecessary sleep and wrapping error with context.

  1. The 10-second sleep after waitForDaemonSetRollout may be unnecessary since the rollout wait already ensures pods are ready. If additional stabilization time is needed, consider documenting why or using a named constant.

  2. Per coding guidelines, errors should be wrapped with context using fmt.Errorf.

♻️ Suggested fix
 	if err != nil {
-		return err
+		return fmt.Errorf("failed to update daemonset %s/%s args: %w", NVSentinelNamespace, daemonsetName, err)
 	}

 	t.Logf("Waiting for daemonset %s/%s rollout to complete", NVSentinelNamespace, daemonsetName)
 	waitForDaemonSetRollout(ctx, t, client, daemonsetName)

-	t.Logf("Waiting 10 seconds for daemonset pods to start")
-	time.Sleep(10 * time.Second)
-
 	return nil

2343-2350: Inconsistent error handling pattern.

The function uses require.NoError which will panic on error, but then returns nil unconditionally. This is inconsistent with UpdateDaemonSetArgs which returns errors to the caller. Consider using consistent error handling:

♻️ Option 1: Return error like UpdateDaemonSetArgs
 	})
-	require.NoError(t, err, "failed to remove args from daemonset %s/%s", NVSentinelNamespace, daemonsetName)
+	if err != nil {
+		return fmt.Errorf("failed to remove args from daemonset %s/%s: %w", NVSentinelNamespace, daemonsetName, err)
+	}

 	t.Logf("Waiting for daemonset %s/%s rollout to complete after restoration", NVSentinelNamespace, daemonsetName)

2433-2477: Parameter daemonsetName is not used for filtering.

The daemonsetName parameter is only used in the error message but not for actual pod filtering. The function finds pods by podNamePattern alone via GetPodOnWorkerNode. If the intent is to verify the pod belongs to the specific DaemonSet, consider adding owner reference validation. Otherwise, consider removing the parameter or documenting that it's for logging only.

♻️ Option: Remove unused parameter or add owner verification

If the parameter is only for logging:

-func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client,
-	daemonsetName string, podNamePattern string) (*v1.Pod, error) {
+func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client,
+	podNamePattern string) (*v1.Pod, error) {

Or add validation that pod is owned by the DaemonSet:

// After getting the pod, verify ownership
for _, ownerRef := range pod.OwnerReferences {
    if ownerRef.Kind == "DaemonSet" && ownerRef.Name == daemonsetName {
        // Pod belongs to expected DaemonSet
        break
    }
}
tests/gpu_health_monitor_test.go (2)

669-673: Use constants consistently for DaemonSet and container names.

The constants GPUHealthMonitorDaemonSetName and GPUHealthMonitorContainerName are defined but not consistently used. Line 669 correctly uses the constants, but line 673 duplicates the DaemonSet name string.

♻️ Suggested fix
-		gpuHealthMonitorPod, err := helpers.GetDaemonSetPodOnWorkerNode(ctx, t, client, GPUHealthMonitorDaemonSetName, "gpu-health-monitor-dcgm-4.x")
+		gpuHealthMonitorPod, err := helpers.GetDaemonSetPodOnWorkerNode(ctx, t, client, GPUHealthMonitorDaemonSetName, GPUHealthMonitorDaemonSetName)

691-694: Remove redundant context retrieval in setup.

The variables nodeName and podName are retrieved from context immediately after being stored, while testNodeName and gpuHealthMonitorPodName are still in scope. This is unnecessary indirection.

♻️ Suggested simplification
 		ctx = context.WithValue(ctx, keyNodeName, testNodeName)
 		ctx = context.WithValue(ctx, keyGpuHealthMonitorPodName, gpuHealthMonitorPodName)
 
 		restConfig := client.RESTConfig()
 
-		nodeName := ctx.Value(keyNodeName).(string)
-		podName := ctx.Value(keyGpuHealthMonitorPodName).(string)
-
-		t.Logf("Injecting Inforom error on node %s", nodeName)
+		t.Logf("Injecting Inforom error on node %s", testNodeName)
 		cmd := []string{"/bin/sh", "-c",
 			fmt.Sprintf("dcgmi test --host %s:%s --inject --gpuid 0 -f 84 -v 0",
 				dcgmServiceHost, dcgmServicePort)}
 
-		stdout, stderr, execErr := helpers.ExecInPod(ctx, restConfig, helpers.NVSentinelNamespace, podName, "", cmd)
+		stdout, stderr, execErr := helpers.ExecInPod(ctx, restConfig, helpers.NVSentinelNamespace, gpuHealthMonitorPodName, "", cmd)
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 9c0336d and b616378.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (10)
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
  • health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
  • tests/event_exporter_test.go
  • tests/gpu_health_monitor_test.go
  • tests/helpers/event_exporter.go
  • tests/helpers/kube.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
  • tests/event_exporter_test.go
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
🧰 Additional context used
πŸ““ Path-based instructions (3)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/helpers/event_exporter.go
  • tests/gpu_health_monitor_test.go
  • tests/helpers/kube.go
**/*.py

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/gpu_health_monitor_test.go
🧠 Learnings (4)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • tests/helpers/event_exporter.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/gpu_health_monitor_test.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
🧬 Code graph analysis (1)
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (2)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (45-45)
  • ProcessingStrategy (77-79)
  • ProcessingStrategy (81-83)
  • ProcessingStrategy (90-92)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-18)
πŸ”‡ Additional comments (14)
tests/helpers/event_exporter.go (2)

220-226: LGTM - Function signature updated correctly for processing strategy validation.

The new expectedProcessingStrategy parameter follows the established pattern. Consider updating the function's godoc comment to document all parameters, as per coding guidelines requiring function comments for exported functions.


245-245: Assertion correctly validates the processingStrategy field.

The new assertion follows the existing pattern and validates that the healthEvent contains the expected processing strategy. This aligns with the PR's objective of adding event handling strategy changes.

health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (1)

301-302: Test coverage for processingStrategy propagation looks good.

The assertions correctly verify that:

  • Failure events propagate the configured STORE_ONLY strategy
  • Connectivity restored events propagate EXECUTE_REMEDIATION strategy (valid per the learning that healthy events can use EXECUTE_REMEDIATION when the Fault Quarantine Manager needs to clear previous fault states)

Also applies to: 421-421, 436-436, 493-493, 549-549

health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (2)

41-66: Constructor changes are well-structured.

The new processing_strategy parameter:

  • Has proper type hint (platformconnector_pb2.ProcessingStrategy) per coding guidelines
  • Uses underscore prefix convention for internal state
  • Is stored for consistent propagation to all emitted events

106-121: Consistent propagation of processingStrategy across all event construction sites.

The processingStrategy field is correctly added to all four HealthEvent construction paths:

  1. clear_dcgm_connectivity_failure (Line 120)
  2. health_event_occurred - failure branch (Line 221)
  3. health_event_occurred - healthy branch (Line 285)
  4. dcgm_connectivity_failed (Line 380)

Also applies to: 206-222, 270-286, 366-381

health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (2)

74-80: CLI option addition looks correct.

The --processing-strategy option is well-defined with:

  • Sensible default (EXECUTE_REMEDIATION)
  • Clear help text describing the two main options

28-51: Function signature update is correct.

The _init_event_processor function:

  • Has proper type hint for the new parameter (platformconnector_pb2.ProcessingStrategy) per coding guidelines
  • Correctly propagates the value to PlatformConnectorEventProcessor
tests/helpers/kube.go (4)

2232-2273: LGTM!

The waitForDaemonSetRollout function correctly checks all necessary DaemonSet status conditions (DesiredNumberScheduled, UpdatedNumberScheduled, NumberReady) and follows the same pattern as the existing WaitForDeploymentRollout function.


2353-2381: LGTM!

The tryUpdateExistingArg helper correctly handles both --flag=value and --flag value argument styles, including the case where a value needs to be inserted after an existing flag.


2383-2405: LGTM!

The function correctly sets container arguments, utilizing tryUpdateExistingArg for updates and appending new arguments when not found.


2407-2431: LGTM!

The function correctly removes container arguments, handling both --flag=value and --flag value styles. The early break after removal ensures safe slice modification.

tests/gpu_health_monitor_test.go (3)

708-724: Assess phase looks correct for STORE_ONLY verification.

The test correctly validates that in STORE_ONLY mode:

  1. Node conditions are not applied
  2. Node is not cordoned

This aligns with the expected behavior where events are stored but no cluster modifications occur.


745-749: Potential issue: Pod may have been replaced after DaemonSet args restoration.

After RemoveDaemonSetArgs restores the DaemonSet configuration at line 732, the pod referenced by podName may have been replaced by a new pod. Executing cleanup commands on this pod could fail if the pod no longer exists or is in a terminating state.

Consider either:

  1. Re-fetching the current pod after the DaemonSet update
  2. Adding error handling for the exec command (currently errors are silently ignored with _, _, _ =)

The current silent error handling (_, _, _ =) does provide resilience, but the cleanup may not actually execute. Verify whether this is acceptable for test reliability.


657-664: Test structure follows established patterns.

The test properly uses the features.New pattern with appropriate labels consistent with other tests in this file. The setup/assess/teardown structure is well-organized.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-gpu-monitor branch from b616378 to 5ce6685 Compare January 12, 2026 14:56
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

πŸ€– Fix all issues with AI agents
In @tests/gpu_health_monitor_test.go:
- Around line 732-734: Replace the hardcoded DaemonSet and container name
strings in the RemoveDaemonSetArgs call with the existing constants
GPUHealthMonitorDaemonSetName and GPUHealthMonitorContainerName, and align the
arg map value to match the setup phase by using "STORE_ONLY" (or an empty
string) for the "--processing-strategy" value so it’s semantically consistent;
update the call to RemoveDaemonSetArgs(ctx, t, client,
GPUHealthMonitorDaemonSetName, GPUHealthMonitorContainerName,
map[string]string{"--processing-strategy": "STORE_ONLY"}) (or use "" instead of
"STORE_ONLY") to make intent clear.

In @tests/helpers/kube.go:
- Around line 2275-2316: In UpdateDaemonSetArgs remove the unused
originalDaemonSet variable and its DeepCopy assignment, add explicit validation
that the target container was found/updated (e.g., track a bool like
foundContainer when iterating containers and return an error if false) so the
function fails instead of silently succeeding, and eliminate the hardcoded
time.Sleep(10 * time.Second) by relying on waitForDaemonSetRollout or making the
post-rollout wait configurable (e.g., add an optional delay parameter or use
context-based waiting) so no arbitrary 10s sleep remains.
🧹 Nitpick comments (4)
tests/helpers/kube.go (2)

2318-2351: Inconsistent error handling pattern.

The function signature returns error, but line 2343 uses require.NoError which will panic on failure. This is inconsistent with UpdateDaemonSetArgs which returns the error to the caller.

Either:

  1. Remove the require.NoError and return the error (consistent with UpdateDaemonSetArgs), or
  2. Change the return type to match the behavior (return nothing, always panic on error)

Also, consider adding container validation similar to the suggestion for UpdateDaemonSetArgs.

Option 1: Return error consistently
 	err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		// ...
 	})
-	require.NoError(t, err, "failed to remove args from daemonset %s/%s", NVSentinelNamespace, daemonsetName)
+	if err != nil {
+		return fmt.Errorf("failed to remove args from daemonset %s/%s: %w", NVSentinelNamespace, daemonsetName, err)
+	}

2433-2477: Parameter daemonsetName is not used for filtering.

The daemonsetName parameter is only used in the error message (line 2473) but doesn't actually verify that the found pod belongs to the specified DaemonSet. The filtering relies solely on podNamePattern.

This could lead to incorrect assumptions by callers. Consider either:

  1. Using daemonsetName to verify ownership via labels (e.g., checking ownerReferences), or
  2. Removing the parameter if podNamePattern is sufficient, or
  3. Adding a comment documenting that podNamePattern is responsible for filtering
Option: Add ownership verification
 func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client,
 	daemonsetName string, podNamePattern string) (*v1.Pod, error) {
 	t.Helper()
 
 	var resultPod *v1.Pod
 
 	require.Eventually(t, func() bool {
 		// Get the pod
 		pod, err := GetPodOnWorkerNode(ctx, t, client, NVSentinelNamespace, podNamePattern)
 		if err != nil {
 			t.Logf("Failed to get pod: %v", err)
 			return false
 		}
 
+		// Verify pod belongs to the expected DaemonSet
+		belongsToDaemonSet := false
+		for _, ownerRef := range pod.OwnerReferences {
+			if ownerRef.Kind == "DaemonSet" && strings.HasPrefix(ownerRef.Name, daemonsetName) {
+				belongsToDaemonSet = true
+				break
+			}
+		}
+		if !belongsToDaemonSet {
+			t.Logf("Pod %s does not belong to daemonset %s", pod.Name, daemonsetName)
+			return false
+		}
+
 		// Verify pod is not being deleted
tests/gpu_health_monitor_test.go (2)

46-48: Unused context key keyOriginalDaemonSet.

The context key keyOriginalDaemonSet is declared but never used anywhere in the test. Either remove it or implement the intended functionality to store/restore the original DaemonSet state.

🧹 Suggested fix
 const (
 	keyGpuHealthMonitorPodName contextKey = "gpuHealthMonitorPodName"
-	keyOriginalDaemonSet       contextKey = "originalDaemonSet"
 )

691-694: Redundant context value extraction.

Variables testNodeName and gpuHealthMonitorPodName are already in scope from lines 677-678. Extracting them again from context is unnecessary.

♻️ Suggested fix
 		ctx = context.WithValue(ctx, keyNodeName, testNodeName)
 		ctx = context.WithValue(ctx, keyGpuHealthMonitorPodName, gpuHealthMonitorPodName)

 		restConfig := client.RESTConfig()

-		nodeName := ctx.Value(keyNodeName).(string)
-		podName := ctx.Value(keyGpuHealthMonitorPodName).(string)
-
-		t.Logf("Injecting Inforom error on node %s", nodeName)
+		t.Logf("Injecting Inforom error on node %s", testNodeName)
 		cmd := []string{"/bin/sh", "-c",
 			fmt.Sprintf("dcgmi test --host %s:%s --inject --gpuid 0 -f 84 -v 0",
 				dcgmServiceHost, dcgmServicePort)}

-		stdout, stderr, execErr := helpers.ExecInPod(ctx, restConfig, helpers.NVSentinelNamespace, podName, "", cmd)
+		stdout, stderr, execErr := helpers.ExecInPod(ctx, restConfig, helpers.NVSentinelNamespace, gpuHealthMonitorPodName, "", cmd)
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between b616378 and 5ce6685.

πŸ“’ Files selected for processing (3)
  • tests/gpu_health_monitor_test.go
  • tests/helpers/event_exporter.go
  • tests/helpers/kube.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/helpers/event_exporter.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/gpu_health_monitor_test.go
  • tests/helpers/kube.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/gpu_health_monitor_test.go
🧠 Learnings (3)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/gpu_health_monitor_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • tests/gpu_health_monitor_test.go
🧬 Code graph analysis (1)
tests/gpu_health_monitor_test.go (1)
tests/helpers/kube.go (7)
  • UpdateDaemonSetArgs (2276-2316)
  • GetDaemonSetPodOnWorkerNode (2433-2477)
  • NVSentinelNamespace (64-64)
  • SetNodeManagedByNVSentinel (1389-1408)
  • ExecInPod (1563-1598)
  • RemoveDaemonSetArgs (2318-2351)
  • RemoveNodeManagedByNVSentinelLabel (1411-1425)
πŸ”‡ Additional comments (6)
tests/helpers/kube.go (4)

2232-2273: LGTM with minor observation.

The function correctly validates DaemonSet rollout by checking DesiredNumberScheduled, UpdatedNumberScheduled, and NumberReady. The logging provides good visibility into rollout progress.

Consider adding an ObservedGeneration check (similar to WaitForDeploymentRollout at line 1020) to ensure the DaemonSet controller has processed the latest spec changes, though this is optional for test helpers.


2353-2381: LGTM!

The function correctly handles multiple argument styles (--flag=value, --flag value, and boolean --flag). The slice insertion logic at line 2373 correctly preserves existing arguments while inserting the new value.


2383-2405: LGTM!

The function follows the same pattern as setEnvVarsOnContainer and correctly handles both updating existing arguments and appending new ones.


2407-2431: LGTM!

The function correctly handles removal of arguments in both --flag=value and --flag value formats. The immediate break after slice modification avoids index issues during iteration.

tests/gpu_health_monitor_test.go (2)

657-765: Test structure and coverage for store-only mode looks good.

The test correctly validates that in STORE_ONLY mode:

  • Node conditions are not applied (EnsureNodeConditionNotPresent)
  • Node is not cordoned (AssertQuarantineState)

The setup/assess/teardown pattern follows the established conventions in the file.


34-43: LGTM on new constants.

The exported constants GPUHealthMonitorContainerName and GPUHealthMonitorDaemonSetName are well-named and useful for test consistency.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-gpu-monitor branch from 5ce6685 to a3f7bef Compare January 14, 2026 10:52
Signed-off-by: Tanisha goyal <[email protected]>
@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-gpu-monitor branch from 9ff31fd to 07739d8 Compare January 14, 2026 10:59
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

πŸ€– Fix all issues with AI agents
In `@tests/gpu_health_monitor_test.go`:
- Around line 728-744: The teardown handler inside feature.Teardown reads
ctx.Value(keyNodeName) and ctx.Value(keyGpuHealthMonitorOriginalArgs) and casts
them directly, which can panic if those keys are unset; add nil checks like the
existing podNameVal check: retrieve nodeNameVal and originalArgsVal, if either
is nil log a message (e.g., "Skipping teardown: nodeName/originalArgs not set")
and return ctx, otherwise cast to string and []string respectively before
calling helpers.RestoreDaemonSetArgs and proceeding; ensure you reference
keyNodeName and keyGpuHealthMonitorOriginalArgs and keep behavior consistent
with the podNameVal guard.
🧹 Nitpick comments (3)
tests/gpu_health_monitor_test.go (1)

45-49: Remove unused constant keyOriginalDaemonSet.

The constant is declared at line 47 but is never used anywhere in the codebase. The test uses keyGpuHealthMonitorOriginalArgs instead. Remove this dead code or add a TODO comment if reserved for future use.

health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (2)

74-80: Consider using click.Choice for built-in validation and auto-completion.

The current approach with string type and manual validation works, but click.Choice would provide better UX with automatic validation and shell auto-completion support.

♻️ Optional enhancement
 `@click.option`(
     "--processing-strategy",
-    type=str,
+    type=click.Choice(["EXECUTE_REMEDIATION", "STORE_ONLY"], case_sensitive=True),
     default="EXECUTE_REMEDIATION",
     help="Event processing strategy: EXECUTE_REMEDIATION or STORE_ONLY",
     required=False,
 )

Note: This would require adjusting the validation logic at lines 125-130 since Click would handle invalid input.


125-132: Log message could be more user-friendly.

Line 132 logs the enum integer value (e.g., 1) rather than the human-readable name. Consider logging the original string for clarity.

♻️ Proposed improvement
-    log.info(f"Event handling strategy configured to: {processing_strategy_value}")
+    log.info(f"Event handling strategy configured to: {processing_strategy}")

This will log "Event handling strategy configured to: EXECUTE_REMEDIATION" instead of "Event handling strategy configured to: 1".

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 5ce6685 and 07739d8.

πŸ“’ Files selected for processing (7)
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
  • health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
  • tests/gpu_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
🧰 Additional context used
πŸ““ Path-based instructions (6)
**/*.py

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
**/values.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/gpu_health_monitor_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/gpu_health_monitor_test.go
**/daemonset*.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Explain DaemonSet variant selection logic in Helm chart documentation

Files:

  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
distros/kubernetes/**/*daemonset*.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes using nodeAffinity based on kata.enabled label
Regular node DaemonSets should use /var/log volume mount for file-based logs
Kata node DaemonSets should use /run/log/journal and /var/log/journal volume mounts for systemd journal

Files:

  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
🧠 Learnings (8)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: health-monitors/syslog-health-monitor/main.go:164-172
Timestamp: 2026-01-14T06:30:15.804Z
Learning: In NVSentinel's syslog-health-monitor, the processing strategy flag accepts UNSPECIFIED from configuration, and platform_connector normalizes any UNSPECIFIED value to EXECUTE_REMEDIATION to maintain consistency with the default execution mode. This normalization happens in platform_connector_server.go around lines 59-60.
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
πŸ“š Learning: 2026-01-14T06:30:15.804Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: health-monitors/syslog-health-monitor/main.go:164-172
Timestamp: 2026-01-14T06:30:15.804Z
Learning: In NVSentinel's syslog-health-monitor, the processing strategy flag accepts UNSPECIFIED from configuration, and platform_connector normalizes any UNSPECIFIED value to EXECUTE_REMEDIATION to maintain consistency with the default execution mode. This normalization happens in platform_connector_server.go around lines 59-60.

Applied to files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
  • distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
  • health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/gpu_health_monitor_test.go
πŸ“š Learning: 2026-01-14T02:33:00.058Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 689
File: janitor/pkg/controller/rebootnode_controller_test.go:371-436
Timestamp: 2026-01-14T02:33:00.058Z
Learning: In the NVSentinel janitor controller tests, tests that demonstrate original bugs or issues that were fixed by a PR should be kept for posterity, even if they reference removed functionality like MaxRebootRetries or RetryCount fields. These historical test cases serve as documentation of what problem was being solved.

Applied to files:

  • tests/gpu_health_monitor_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • tests/gpu_health_monitor_test.go
πŸ“š Learning: 2026-01-12T05:13:24.947Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:24.947Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.

Applied to files:

  • tests/gpu_health_monitor_test.go
🧬 Code graph analysis (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (2)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (44-44)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (80-82)
  • ProcessingStrategy (89-91)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-18)
tests/gpu_health_monitor_test.go (4)
tests/helpers/kube.go (8)
  • UpdateDaemonSetArgs (2300-2346)
  • GetDaemonSetPodOnWorkerNode (2450-2494)
  • NVSentinelNamespace (64-64)
  • SetNodeManagedByNVSentinel (1401-1420)
  • ExecInPod (1575-1610)
  • EnsureNodeConditionNotPresent (1811-1832)
  • RestoreDaemonSetArgs (2350-2392)
  • RemoveNodeManagedByNVSentinelLabel (1423-1437)
tests/helpers/metadata.go (3)
  • CreateTestMetadata (59-108)
  • InjectMetadata (110-181)
  • DeleteMetadata (183-229)
commons/pkg/auditlogger/auditlogger.go (1)
  • Log (114-134)
tests/helpers/fault_quarantine.go (2)
  • AssertQuarantineState (317-384)
  • QuarantineAssertion (56-60)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: ko-build-test (health-monitors/kubernetes-object-monitor, .)
  • GitHub Check: ko-build-test (labeler, .)
  • GitHub Check: ko-build-test (fault-remediation, .)
  • GitHub Check: container-build-test (syslog-health-monitor, make -C health-monitors/syslog-health-monitor docker...
  • GitHub Check: container-build-test (gpu-health-monitor-dcgm4, make -C health-monitors/gpu-health-monitor docker...
  • GitHub Check: container-build-test (file-server-cleanup, make -C log-collector docker-build-file-server-cleanup)
  • GitHub Check: container-build-test (gpu-health-monitor-dcgm3, make -C health-monitors/gpu-health-monitor docker...
  • GitHub Check: container-build-test (log-collector, make -C log-collector docker-build-log-collector)
  • GitHub Check: E2E Tests (ARM64 + PostgreSQL)
  • GitHub Check: E2E Tests (ARM64 + MongoDB)
  • GitHub Check: E2E Tests (AMD64 + PostgreSQL)
  • GitHub Check: E2E Tests (AMD64 + MongoDB)
  • GitHub Check: CodeQL PR Analysis
  • GitHub Check: simple-lint (scripts, make -C scripts lint, Run shellcheck on scripts)
  • GitHub Check: modules-lint-test (fault-quarantine)
  • GitHub Check: modules-lint-test (labeler)
  • GitHub Check: modules-lint-test (platform-connectors)
  • GitHub Check: modules-lint-test (node-drainer)
  • GitHub Check: health-monitors-lint-test (csp-health-monitor)
  • GitHub Check: health-monitors-lint-test (kubernetes-object-monitor)
πŸ”‡ Additional comments (19)
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml (1)

56-62: Well-documented processingStrategy configuration.

The inline comments clearly explain both modes and their behavior. The default EXECUTE_REMEDIATION aligns with the platform_connector's normalization behavior (UNSPECIFIED β†’ EXECUTE_REMEDIATION). Based on learnings, this is consistent with how syslog-health-monitor handles the same flag.

Consider adding a note that UNSPECIFIED is also accepted but will be normalized to EXECUTE_REMEDIATION by the platform connector, for completeness.

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml (1)

61-62: LGTM!

The --processing-strategy argument is correctly added and sourced from .Values.processingStrategy. The quoting ensures proper string handling.

tests/gpu_health_monitor_test.go (2)

658-708: Test setup properly configures STORE_ONLY mode and injects errors.

The setup correctly:

  1. Updates DaemonSet args to use STORE_ONLY strategy
  2. Retrieves the pod after rollout
  3. Injects test metadata and error
  4. Stores original args for restoration

One observation: The error injection happens in setup (lines 698-705) rather than in an Assess step. This is fine for this test since you're verifying the absence of cluster changes, but consider documenting this design choice with a comment.


710-726: Assess step correctly validates STORE_ONLY behavior.

The assertions properly verify that:

  1. Node condition GpuInforomWatch is not applied
  2. Node is not cordoned and has no quarantine annotation

This aligns with the PR objective of verifying that STORE_ONLY mode persists/exports events without modifying cluster resources.

health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (5)

41-66: Clean implementation of processing_strategy parameter.

The new processing_strategy parameter is properly typed with platformconnector_pb2.ProcessingStrategy and stored as an instance variable. This follows PEP 8 and the coding guidelines requirement for type hints.


106-121: processingStrategy correctly propagated in clear_dcgm_connectivity_failure.

The HealthEvent includes processingStrategy=self._processing_strategy, ensuring connectivity restoration events respect the configured strategy.


206-223: processingStrategy correctly propagated in health_event_occurred (failure path).

The HealthEvent for GPU failures includes the processing strategy field.


270-287: processingStrategy correctly propagated in health_event_occurred (healthy path).

The HealthEvent for healthy status includes the processing strategy field, maintaining consistency across all event types.


366-381: processingStrategy correctly propagated in dcgm_connectivity_failed.

The DCGM connectivity failure HealthEvent includes the processing strategy field, completing the coverage of all HealthEvent creation paths.

health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (6)

99-108: Test correctly updated to include processing_strategy parameter.

The PlatformConnectorEventProcessor instantiation now includes platformconnector_pb2.STORE_ONLY as the processing_strategy argument, aligning with the updated constructor signature.


301-302: Good assertion for processingStrategy propagation.

Verifying that nvlink_failure_event.processingStrategy == platformconnector_pb2.STORE_ONLY confirms the field is correctly propagated through the HealthEvent pipeline.


421-436: Assertions verify processingStrategy for multi-GPU events.

Both GPU 0 and GPU 1 events are checked for the correct processingStrategy value, ensuring consistent propagation across multiple entities.


493-494: DCGM connectivity failure event correctly asserts processingStrategy.

The test verifies the processingStrategy field is set correctly for connectivity failure events.


523-549: Good coverage of EXECUTE_REMEDIATION strategy.

This test uses EXECUTE_REMEDIATION (line 523) and verifies the restored event carries this strategy (line 549), providing coverage for both strategy values in the test suite.


595-604: Retry test correctly updated with processing_strategy.

The cache cleanup and retry test includes the processing_strategy parameter, maintaining consistency with other test cases.

health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (4)

24-24: LGTM!

The alias platformconnector_pb2 clearly indicates the module's purpose and is used consistently throughout the file.


28-51: LGTM!

The function signature correctly includes the new processing_strategy parameter with proper type hints as per coding guidelines, and the parameter is properly propagated to PlatformConnectorEventProcessor.


81-91: LGTM!

The new processing_strategy parameter is correctly added to the CLI function signature, consistent with the existing parameter style.


137-150: LGTM!

The processing_strategy_value is correctly propagated to the event processor initialization, matching the expected protobuf enum type.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Signed-off-by: Tanisha goyal <[email protected]>
@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-gpu-monitor branch from 80494cc to 9424707 Compare January 15, 2026 09:26
@lalitadithya lalitadithya merged commit 6ab82e4 into NVIDIA:main Jan 15, 2026
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants