feat: add event handling strategy changes in syslog health monitor #609

tanishagoyal2 · 2025-12-22T07:00:26Z

Summary

Type of Change

Component(s) Affected

Testing

Tests pass locally
Manual testing completed
No breaking changes (or documented)

Checklist

Self-review completed
Documentation updated (if needed)
Ready for review

Testing

Tested on dev cluster nvs-dgxc-k8s-oci-lhr-dev3 by installing nvsentinel from main changes
Updated syslog-health-monitor image with this branch changes and running syslog-monitor in STORE_ONLY strategy
Injected XID 13 in gpu node
Event was created with STORE_ONLY strategy

Event exporter also exported event with STORE_ONLY strategy

Nodes were not cordoned and node condition was not applied

Summary by CodeRabbit

New Features
- Added a processing-strategy option for the syslog health monitor with two modes: EXECUTE_REMEDIATION (default) and STORE_ONLY; configurable via flag and DaemonSet args.
Tests
- Added reusable setup/teardown helpers and DaemonSet arg utilities for syslog tests.
- Added STORE_ONLY integration test and updated assertions to validate the configured processing strategy.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-22T07:00:36Z

📝 Walkthrough

Walkthrough

This PR adds a configurable processing strategy (EXECUTE_REMEDIATION or STORE_ONLY) via CLI flag and Helm value, threads the chosen strategy through SyslogMonitor and per-check handlers, and includes it in emitted HealthEvent payloads and tests.

Changes

Cohort / File(s)	Summary
Helm Configuration `distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl`, `distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml`	Added `processingStrategy` Helm value (default `EXECUTE_REMEDIATION`, option `STORE_ONLY`) and injected `--processing-strategy {{ .Values.processingStrategy }}` into DaemonSet container args.
Main Entry Point `health-monitors/syslog-health-monitor/main.go`	Added `--processing-strategy` CLI flag, validated against enum, logged selection, and passed strategy into `NewSyslogMonitor`.
SyslogMonitor Core `health-monitors/syslog-health-monitor/pkg/syslog-monitor/...` `.../syslogmonitor.go`, `.../types.go`, `.../syslogmonitor_test.go`	SyslogMonitor now stores `processingStrategy`; constructors (`NewSyslogMonitor`, `NewSyslogMonitorWithFactory`) accept and propagate it to handlers; health events include `ProcessingStrategy`; tests updated.
Handlers (XID / SXID / GPUFallen) `health-monitors/syslog-health-monitor/pkg/{xid,sxid,gpufallen}/...` `.../types.go`, `.../_handler.go`, `.../_handler_test.go`	Each handler gained a `processingStrategy pb.ProcessingStrategy` field; constructors updated to accept strategy; emitted events carry `ProcessingStrategy`; tests adjusted to pass/assert strategy.
Event Validation Tests `tests/helpers/event_exporter.go`, `tests/event_exporter_test.go`	`ValidateCloudEvent` signature extended with expected processing strategy and now asserts `healthEvent["processingStrategy"]` equals it; tests updated to pass the strategy.
Kubernetes Test Helpers `tests/helpers/kube.go`	Added helpers to update/remove DaemonSet container args, wait for DaemonSet rollout, and retrieve DaemonSet pod on a worker node: `UpdateDaemonSetArgs`, `RemoveDaemonSetArgs`, `GetDaemonSetPodOnWorkerNode` (and internal rollout helpers).
Syslog Test Orchestration Helpers `tests/helpers/syslog-health-monitor.go`	Added `SetUpSyslogHealthMonitor` and `TearDownSyslogHealthMonitor` to manage DaemonSet arg updates, pod retrieval, metadata injection, port-forwarding, and cleanup.
Syslog Health Monitor Tests `tests/syslog_health_monitor_test.go`	Tests refactored to use new setup/teardown helpers; added `TestSyslogHealthMonitorStoreOnlyStrategy` validating STORE_ONLY path (no node condition/quarantine changes); updated port/port-forward usage.
Platform Connector Test Adjustments `platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go`	Strengthened assertions in `TestProcessHealthEvents_StoreOnlyStrategy` to require non-empty condition/event collections before inspecting items.
Misc Tests Adjustments `tests/helpers/*`	Updated/added helpers and tests to support propagation and validation of the processing strategy across test flows.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A flag on helm, a hop so spry,
EXECUTE or STORE beneath the sky,
From main to handler each event will keep,
A rabbit notes the strategy in a leap,
Hooray — the monitor sings and never sleeps!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.74% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add event handling strategy changes in syslog health monitor' accurately describes the main change: adding a new processingStrategy parameter throughout the syslog health monitor codebase to enable different event handling strategies (EXECUTE_REMEDIATION vs STORE_ONLY).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 49d8021 and 2ddafe5.

📒 Files selected for processing (1)

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Note truthy value requirements in Helm chart documentation where applicable

Applied to files:

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Include examples for non-obvious configurations in Helm chart documentation

Applied to files:

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Document all values in Helm chart `values.yaml` with inline comments

Applied to files:

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl

🔇 Additional comments (1)

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl (1)

103-104: No changes needed—processingStrategy is properly defined with a default value.

The values.yaml already defines processingStrategy: EXECUTE_REMEDIATION with clear documentation explaining valid values and their behavior. The template will correctly use this default if not overridden by the user, so no additional defensive checks in the template are required.

Likely an incorrect or invalid review comment.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (10)

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl (1)
103-104: Consider quoting the processing strategy value for consistency.

The processingStrategy value is passed unquoted, while similar string arguments like $kataLabel at line 94 use the quote function. For consistency and to follow YAML best practices, consider quoting this value.
🔎 Proposed fix
             - "--processing-strategy"
-            - {{ $root.Values.processingStrategy }}
+            - {{ $root.Values.processingStrategy | quote }}
tests/helpers/healthevent.go (2)
153-156: Add godoc comment for the exported method.

Per coding guidelines, exported Go functions should have function comments. Other With* methods in this file lack comments too, but new code should set a better example.
Suggested documentation
+// WithProcessingStrategy sets the processing strategy for the health event.
+// Use protos.ProcessingStrategy_EXECUTE_REMEDIATION (0) or protos.ProcessingStrategy_STORE_ONLY (1).
 func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate {
 	h.ProcessingStrategy = strategy
 	return h
 }
48-48: Consider using the protobuf type for consistency.

The protobuf-generated Go code uses int32 for ProcessingStrategy. While int works for JSON serialization in tests, using int32 would be more consistent with the generated types.
Suggested change
-	ProcessingStrategy  int                  `json:"processingStrategy,omitempty"`
+	ProcessingStrategy  int32                `json:"processingStrategy,omitempty"`
This would require updating the setter signature as well:
-func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate {
+func (h *HealthEventTemplate) WithProcessingStrategy(strategy int32) *HealthEventTemplate {
data-models/protobufs/health_event.proto (1)
77-77: Add field-level comment for processingStrategy.

Per coding guidelines for protobuf files: "Include comprehensive comments for all fields in Protocol Buffer messages." The enum declaration has good comments, but the field itself should also have a brief comment explaining its purpose in the context of the HealthEvent message.
Suggested addition
   BehaviourOverrides drainOverrides = 15;
+  // Determines how downstream modules should handle this event (remediation vs observability-only).
   ProcessingStrategy processingStrategy = 16;
 }
health-monitors/syslog-health-monitor/main.go (1)
164-171: Consider enhancing the error message with valid options.

The validation correctly uses the generated pb.ProcessingStrategy_value map. Consider listing valid values in the error message to improve user experience.
Suggested improvement
 	value, ok := pb.ProcessingStrategy_value[*processingStrategyFlag]
 	if !ok {
-		return fmt.Errorf("unexpected processingStrategy value: %q", *processingStrategyFlag)
+		return fmt.Errorf("unexpected processingStrategy value: %q (valid values: EXECUTE_REMEDIATION, STORE_ONLY)", *processingStrategyFlag)
 	}
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (1)
239-242: Consider using the enum constant directly instead of accessing the handler field.

Line 241 accesses handler2.processingStrategy directly for the assertion. While this works, using the expected enum constant (pb.ProcessingStrategy_EXECUTE_REMEDIATION) would be more explicit and resilient to future encapsulation changes.
🔎 Suggested change
-		assert.Equal(t, handler2.processingStrategy, events.Events[0].ProcessingStrategy)
+		assert.Equal(t, pb.ProcessingStrategy_EXECUTE_REMEDIATION, events.Events[0].ProcessingStrategy)
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (1)
191-195: Brittle entity index assertions may break if entity order changes.

The assertions use hardcoded indices (EntitiesImpacted[3], EntitiesImpacted[4]) to access GPU entities. If the entity ordering in ProcessLine changes, these tests will fail with confusing errors.

Consider iterating over entities by type, similar to the pattern used in gpufallen_handler_test.go (lines 124-136).
🔎 Suggested approach
-	// Verify GPU entity
-	assert.Equal(t, "GPU", event.EntitiesImpacted[3].EntityType)
-	assert.Equal(t, "1", event.EntitiesImpacted[3].EntityValue)
-	assert.Equal(t, "GPU_UUID", event.EntitiesImpacted[4].EntityType)
-	assert.Equal(t, "GPU-aaaabbbb-cccc-dddd-eeee-ffffffffffff", event.EntitiesImpacted[4].EntityValue)
+	// Verify GPU entities by type rather than assuming order
+	var hasGPU, hasGPUUUID bool
+	for _, entity := range event.EntitiesImpacted {
+		switch entity.EntityType {
+		case "GPU":
+			hasGPU = true
+			assert.Equal(t, "1", entity.EntityValue)
+		case "GPU_UUID":
+			hasGPUUUID = true
+			assert.Equal(t, "GPU-aaaabbbb-cccc-dddd-eeee-ffffffffffff", entity.EntityValue)
+		}
+	}
+	assert.True(t, hasGPU, "Should have GPU entity")
+	assert.True(t, hasGPUUUID, "Should have GPU_UUID entity")
tests/helpers/syslog-health-monitor.go (1)
15-15: Missing package-level godoc comment.

As per coding guidelines, package-level godoc is required for all Go packages.
Proposed fix
+// Package helpers provides test utility functions for NVSentinel integration tests.
 package helpers
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (1)
427-446: Consider adding a test case to validate ProcessingStrategy is stored in handler.

The existing test cases verify handler construction but don't assert that processingStrategy is correctly stored. Consider adding an assertion.
Proposed enhancement
 			} else {
 				assert.NoError(t, err)
 				require.NotNil(t, handler)
 				assert.Equal(t, tc.nodeName, handler.nodeName)
 				assert.Equal(t, tc.agentName, handler.defaultAgentName)
 				assert.Equal(t, tc.componentClass, handler.defaultComponentClass)
 				assert.Equal(t, tc.checkName, handler.checkName)
+				assert.Equal(t, pb.ProcessingStrategy_EXECUTE_REMEDIATION, handler.processingStrategy)
 				assert.NotNil(t, handler.pciToGPUUUID)
 				assert.NotNil(t, handler.parser)
 				assert.NotNil(t, handler.metadataReader)
 			}
tests/helpers/kube.go (1)
2346-2390: Consider clarifying the daemonsetName parameter usage.

The daemonsetName parameter is only used in the error message but not for any actual filtering logic (which relies on podNamePattern). This could be confusing to callers. Consider either using it for filtering or documenting this clearly.
🔎 Suggested documentation improvement
+// GetDaemonSetPodOnWorkerNode returns a running, ready pod matching the given name pattern on a real worker node.
+// Note: daemonsetName is used for error messages; filtering is done via podNamePattern.
 func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client,
 	daemonsetName string, podNamePattern string) (*v1.Pod, error) {

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61f47cb and a216b57.

⛔ Files ignored due to path filters (1)

data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (34)

data-models/protobufs/health_event.proto
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
event-exporter/pkg/transformer/cloudevents.go
event-exporter/pkg/transformer/cloudevents_test.go
fault-quarantine/pkg/evaluator/rule_evaluator_test.go
fault-quarantine/pkg/initializer/init.go
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
health-monitors/syslog-health-monitor/main.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
health-monitors/syslog-health-monitor/pkg/gpufallen/types.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
health-monitors/syslog-health-monitor/pkg/sxid/types.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go
health-monitors/syslog-health-monitor/pkg/xid/types.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
platform-connectors/pkg/connectors/kubernetes/process_node_events.go
store-client/pkg/client/mongodb_pipeline_builder.go
store-client/pkg/client/pipeline_builder.go
store-client/pkg/client/pipeline_builder_test.go
store-client/pkg/client/postgresql_pipeline_builder.go
tests/event_exporter_test.go
tests/helpers/event_exporter.go
tests/helpers/healthevent.go
tests/helpers/kube.go
tests/helpers/syslog-health-monitor.go
tests/syslog_health_monitor_test.go

🧰 Additional context used

📓 Path-based instructions (5)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

tests/helpers/event_exporter.go
store-client/pkg/client/pipeline_builder_test.go
tests/helpers/healthevent.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go
event-exporter/pkg/transformer/cloudevents.go
store-client/pkg/client/mongodb_pipeline_builder.go
health-monitors/syslog-health-monitor/pkg/sxid/types.go
fault-quarantine/pkg/initializer/init.go
fault-quarantine/pkg/evaluator/rule_evaluator_test.go
event-exporter/pkg/transformer/cloudevents_test.go
store-client/pkg/client/pipeline_builder.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
health-monitors/syslog-health-monitor/pkg/xid/types.go
store-client/pkg/client/postgresql_pipeline_builder.go
health-monitors/syslog-health-monitor/main.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
tests/helpers/syslog-health-monitor.go
platform-connectors/pkg/connectors/kubernetes/process_node_events.go
health-monitors/syslog-health-monitor/pkg/gpufallen/types.go
tests/event_exporter_test.go
tests/helpers/kube.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
tests/syslog_health_monitor_test.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

store-client/pkg/client/pipeline_builder_test.go
fault-quarantine/pkg/evaluator/rule_evaluator_test.go
event-exporter/pkg/transformer/cloudevents_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
tests/event_exporter_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
tests/syslog_health_monitor_test.go

data-models/protobufs/**/*.proto

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

data-models/protobufs/health_event.proto

**/values.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml

**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py

🧠 Learnings (8)

📚 Learning: 2025-11-07T04:16:43.079Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

health-monitors/syslog-health-monitor/pkg/sxid/types.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
health-monitors/syslog-health-monitor/pkg/xid/types.go
health-monitors/syslog-health-monitor/main.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`

Applied to files:

health-monitors/syslog-health-monitor/pkg/sxid/types.go

📚 Learning: 2025-11-10T10:25:19.443Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

fault-quarantine/pkg/initializer/init.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
tests/event_exporter_test.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

tests/event_exporter_test.go
tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `testify/assert` and `testify/require` for assertions in Go tests

Applied to files:

tests/event_exporter_test.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

tests/event_exporter_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

tests/syslog_health_monitor_test.go

🧬 Code graph analysis (17)

store-client/pkg/client/pipeline_builder_test.go (3)

store-client/pkg/client/pipeline_builder.go (1)

PipelineBuilder (26-47)

store-client/pkg/client/mongodb_pipeline_builder.go (1)

NewMongoDBPipelineBuilder (29-31)

store-client/pkg/client/postgresql_pipeline_builder.go (1)

NewPostgreSQLPipelineBuilder (29-31)

tests/helpers/healthevent.go (2)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-17)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

event-exporter/pkg/transformer/cloudevents.go (2)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-17)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

store-client/pkg/client/mongodb_pipeline_builder.go (2)

store-client/pkg/datastore/types.go (4)

ToPipeline (161-163)

D (131-133)

E (126-128)

A (136-138)

data-models/pkg/protos/health_event.pb.go (1)

ProcessingStrategy_EXECUTE_REMEDIATION (46-46)

event-exporter/pkg/transformer/cloudevents_test.go (2)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-17)

data-models/pkg/protos/health_event.pb.go (5)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

ProcessingStrategy_STORE_ONLY (47-47)

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (3)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-17)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

XIDHandler (29-39)

store-client/pkg/client/postgresql_pipeline_builder.go (2)

store-client/pkg/datastore/types.go (4)

ToPipeline (161-163)

D (131-133)

E (126-128)

A (136-138)

data-models/pkg/protos/health_event.pb.go (1)

ProcessingStrategy_EXECUTE_REMEDIATION (46-46)

health-monitors/syslog-health-monitor/main.go (1)

data-models/pkg/protos/health_event.pb.go (1)

ProcessingStrategy_value (56-59)

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)

data-models/pkg/protos/health_event.pb.go (17)

HealthEvent (260-280)

HealthEvent (293-293)

HealthEvent (308-310)

Entity (208-214)

Entity (227-227)

Entity (242-244)

RecommendedAction (89-89)

RecommendedAction (139-141)

RecommendedAction (143-145)

RecommendedAction (152-154)

RecommendedAction_CONTACT_SUPPORT (94-94)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

ProcessingStrategy_STORE_ONLY (47-47)

ProcessingStrategy_EXECUTE_REMEDIATION (46-46)

platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)

NewK8sConnector (47-58)

health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (3)

data-models/pkg/protos/health_event.pb.go (6)

ProcessingStrategy_EXECUTE_REMEDIATION (46-46)

ProcessingStrategy_STORE_ONLY (47-47)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)

NewXIDHandler (33-58)

health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

XIDHandler (29-39)

health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go (1)

health-monitors/syslog-health-monitor/pkg/gpufallen/types.go (1)

GPUFallenHandler (47-57)

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (2)

data-models/pkg/protos/health_event.pb.go (11)

HealthEvents (156-162)

HealthEvents (175-175)

HealthEvents (190-192)

HealthEvent (260-280)

HealthEvent (293-293)

HealthEvent (308-310)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

ProcessingStrategy_STORE_ONLY (47-47)

platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)

K8sConnector (37-45)

tests/event_exporter_test.go (1)

tests/helpers/event_exporter.go (1)

ValidateCloudEvent (221-247)

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)

data-models/pkg/protos/health_event.pb.go (2)

ProcessingStrategy_STORE_ONLY (47-47)

ProcessingStrategy_EXECUTE_REMEDIATION (46-46)

health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (1)

data-models/pkg/protos/health_event.pb.go (2)

ProcessingStrategy_STORE_ONLY (47-47)

ProcessingStrategy_EXECUTE_REMEDIATION (46-46)

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (2)

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (1)

NewSXIDHandler (30-42)

data-models/pkg/protos/health_event.pb.go (3)

ProcessingStrategy_EXECUTE_REMEDIATION (46-46)

ProcessingStrategy_STORE_ONLY (47-47)

RecommendedAction_CONTACT_SUPPORT (94-94)

tests/syslog_health_monitor_test.go (3)

tests/helpers/syslog-health-monitor.go (3)

SetUpSyslogHealthMonitor (34-74)

StubJournalHTTPPort (28-28)

TearDownSyslogHealthMonitor (77-116)

tests/helpers/kube.go (2)

InjectSyslogMessages (1796-1819)

PortForwardPod (1590-1634)

tests/helpers/fault_quarantine.go (2)

AssertQuarantineState (315-382)

QuarantineAssertion (56-60)

🔇 Additional comments (65)

fault-quarantine/pkg/initializer/init.go (1)
66-66: Verify the pipeline filtering aligns with fault-quarantine requirements.

The switch from BuildAllHealthEventInsertsPipeline() to BuildProcessableHealthEventInsertsPipeline() means the fault-quarantine module will now only observe health events marked with the EXECUTE_REMEDIATION processing strategy. Events marked STORE_ONLY will be excluded from triggering quarantine actions.

Run the following script to confirm the new pipeline builder exists and verify its filtering logic:
#!/bin/bash
# Verify BuildProcessableHealthEventInsertsPipeline implementation and filtering behavior

# Find the pipeline builder implementation
echo "=== Locating BuildProcessableHealthEventInsertsPipeline implementation ==="
ast-grep --pattern $'func ($$$) BuildProcessableHealthEventInsertsPipeline() $$$'

# Find usages to understand the filtering behavior
echo -e "\n=== Finding references to ProcessingStrategy enum ==="
rg -n -C3 --type=go 'ProcessingStrategy|EXECUTE_REMEDIATION|STORE_ONLY' store-client/

# Find the builder interface/struct definition
echo -e "\n=== Locating PipelineBuilder interface ==="
rg -n -C3 --type=go 'type.*PipelineBuilder|BuildProcessableHealthEventInsertsPipeline'
fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)

263-263: LGTM! Default enum value correctly validated.

The test correctly expects processingStrategy to be 0 (EXECUTE_REMEDIATION), which is the protobuf default for the new enum field.

event-exporter/pkg/transformer/cloudevents.go (1)

66-66: LGTM! Consistent enum-to-string conversion.

The implementation correctly uses ProcessingStrategy.String() to include the strategy in CloudEvent data, matching the pattern used for recommendedAction at line 61.

store-client/pkg/client/pipeline_builder_test.go (1)

69-86: LGTM! Well-structured table-driven test.

The test correctly validates the new BuildProcessableHealthEventInsertsPipeline() method for both MongoDB and PostgreSQL builders, following the established patterns in this test file.

health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go (1)

53-53: LGTM! Safe addition of unexported field.

The processingStrategy field is correctly added as an unexported field to store the health event processing strategy for this monitor instance.

health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

20-20: LGTM! Consistent import and field addition.

The protobuf import and processingStrategy field are correctly added to enable strategy propagation through the XID handler to health events.

Also applies to: 34-34

health-monitors/syslog-health-monitor/pkg/gpufallen/types.go (1)

23-23: LGTM! Consistent with other handler implementations.

The protobuf import and processingStrategy field addition follow the same pattern used in the XID and SXID handlers.

Also applies to: 52-52

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml (1)

51-56: LGTM! Well-documented configuration option.

The processingStrategy configuration is clearly documented with valid values, default behavior, and behavioral differences. The default value EXECUTE_REMEDIATION maintains backward compatibility with existing deployments.

As per coding guidelines, the inline documentation follows Helm best practices for values.yaml files.

data-models/protobufs/health_event.proto (1)

32-38: Good enum design with clear documentation.

The enum is well-designed with EXECUTE_REMEDIATION=0 as the default (zero value), ensuring backward compatibility for existing events that don't set this field. The comments clearly explain the behavioral difference between the two strategies.
store-client/pkg/client/mongodb_pipeline_builder.go (1)
87-100: Verify behavior when processingStrategy field is missing or null.

The filter matches documents where fullDocument.healthevent.processingstrategy equals 0 (EXECUTE_REMEDIATION). In MongoDB, documents missing this field will have null for that path, which does not equal 0. This could cause backward compatibility issues with existing health events that were created before this field was added.

If existing events should be treated as EXECUTE_REMEDIATION (the intended default), consider updating the filter to handle missing fields:
Potential fix if missing fields need to match
 func (b *MongoDBPipelineBuilder) BuildProcessableHealthEventInsertsPipeline() datastore.Pipeline {
 	return datastore.ToPipeline(
 		datastore.D(
 			datastore.E("$match", datastore.D(
 				datastore.E("operationType", datastore.D(
 					datastore.E("$in", datastore.A("insert")),
 				)),
-				datastore.E("fullDocument.healthevent.processingstrategy", int32(protos.ProcessingStrategy_EXECUTE_REMEDIATION)),
+				datastore.E("$or", datastore.A(
+					datastore.D(datastore.E("fullDocument.healthevent.processingstrategy", int32(protos.ProcessingStrategy_EXECUTE_REMEDIATION))),
+					datastore.D(datastore.E("fullDocument.healthevent.processingstrategy", datastore.D(datastore.E("$exists", false)))),
+				)),
 			)),
 		),
 	)
 }
Please verify whether existing health events in the database have this field set, or if all events going forward will include it.
store-client/pkg/client/pipeline_builder.go (1)

35-38: LGTM!

The new interface method is well-documented with clear explanation of:

What it does: watches for "processable" health event inserts with processingStrategy=EXECUTE_REMEDIATION

Who uses it: fault-quarantine

Why: to ignore observability-only events (STORE_ONLY)

This follows the established pattern of the other pipeline builder methods.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

1-6: This file is auto-generated by the protobuf compiler and is correctly checked into the repository. The "NO CHECKED-IN PROTOBUF GENCODE" comment is a standard protobuf compiler-generated header, not a project policy directive. For Python (non-C++), checking in generated protobuf files is a standard practice supported by the protobuf project. The repository properly manages this via the protos-generate Makefile target for regeneration and the protos-lint target for CI validation to ensure generated files stay up to date with their source .proto files.

Likely an incorrect or invalid review comment.

tests/helpers/event_exporter.go (1)

221-246: LGTM!

The new expectedProcessingStrategy parameter integrates cleanly with the existing validation pattern. The assertion at line 245 correctly validates the new field in the CloudEvent payload.

health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)

20-36: LGTM!

The new processingStrategy field is correctly typed and logically positioned among the configuration fields in SXIDHandler.

event-exporter/pkg/transformer/cloudevents_test.go (1)

69-108: LGTM!

The test correctly validates the new processingStrategy field in CloudEvent output. The explicit STORE_ONLY value in the first comprehensive test case provides good coverage, while other focused test cases appropriately default to the zero value (EXECUTE_REMEDIATION).

tests/event_exporter_test.go (1)

85-85: LGTM!

The test correctly expects "EXECUTE_REMEDIATION" as the processing strategy, which aligns with the default behavior for health events.

health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go (2)

29-51: LGTM!

The processingStrategy parameter is correctly threaded through the constructor to the handler struct, following the established pattern for other configuration fields.

225-239: LGTM!

The ProcessingStrategy field is correctly populated from the handler's configuration when creating health events.

health-monitors/syslog-health-monitor/main.go (2)

70-71: LGTM!

The new --processing-strategy flag with sensible default and clear help text follows the existing flag pattern in this file.

173-184: LGTM!

The processingStrategy is correctly converted and passed to NewSyslogMonitor. The overall wiring integrates cleanly with the existing initialization flow.

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (2)

30-42: LGTM! Constructor correctly extended with ProcessingStrategy parameter.

The constructor signature update properly accepts the new processingStrategy parameter and assigns it to the handler struct. This is consistent with the pattern used across other handlers (XID, GPUFallen) in this PR.

109-109: ProcessingStrategy correctly propagated to emitted HealthEvent.

The handler's processingStrategy is properly assigned to the HealthEvent.ProcessingStrategy field, ensuring events carry the configured strategy for downstream processing.

health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (3)

167-175: LGTM! Test updated to validate ProcessingStrategy propagation.

The test correctly passes pb.ProcessingStrategy_STORE_ONLY to the handler constructor, following the table-driven test pattern and using testify/require for assertions as per coding guidelines.

185-185: Good assertion coverage for ProcessingStrategy field.

This assertion ensures the emitted event carries the correct processing strategy, validating the end-to-end propagation from handler construction to event emission.

193-319: Comprehensive XID tracking tests properly updated with ProcessingStrategy.

All test cases in TestXIDTracking consistently use pb.ProcessingStrategy_EXECUTE_REMEDIATION, maintaining test consistency while validating the new parameter integration.

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (3)

1391-1506: Well-structured table-driven test for ProcessingStrategy behavior.

The test covers key scenarios:

STORE_ONLY events should not create node conditions or K8s events

EXECUTE_REMEDIATION events should create node conditions

Mixed strategies properly filter only EXECUTE_REMEDIATION events

This follows the coding guidelines for table-driven tests with descriptive names.

1550-1562: Reasonable approach to count NVSentinel-specific conditions.

The logic excludes standard Kubernetes node conditions (NodeReady, NodeMemoryPressure, etc.) to isolate NVSentinel-added conditions. The logging at line 1560 aids debugging.

1508-1589: Good test isolation with per-case setup.

Each test case creates its own localClientSet, ringBuffer, and connector, preventing state leakage between test runs. The defer close(stopCh) ensures proper cleanup.

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (3)

18-25: Imports correctly updated for ProcessingStrategy testing.

New imports for os, filepath, and the pb alias are properly added to support the temp file creation and protobuf type usage in tests.

143-196: Good integration test with realistic metadata setup.

This test creates a temporary gpu_metadata.json, exercises the full ProcessLine flow with valid NVSwitch topology, and validates the emitted event including the new ProcessingStrategy field.

225-249: Test cases properly updated with ProcessingStrategy parameter.

The TestProcessLine table-driven tests correctly pass pb.ProcessingStrategy_EXECUTE_REMEDIATION and validate it in assertions, following the established pattern. Uses testify/assert and testify/require per coding guidelines.

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (7)

309-317: Test correctly validates ProcessingStrategy initialization.

The test verifies that NewSyslogMonitor properly stores the processingStrategy parameter. The assertion at line 317 confirms the monitor's field is set to the expected value.

329-334: Good coverage testing both constructor variants.

Testing NewSyslogMonitorWithFactory with EXECUTE_REMEDIATION ensures both constructor paths properly handle the new parameter.

391-403: TestJournalProcessingLogic updated with ProcessingStrategy.

The test correctly passes pb.ProcessingStrategy_EXECUTE_REMEDIATION to maintain consistency with the expected handler behavior during journal processing.

494-544: TestJournalStateManagement properly tests state persistence with ProcessingStrategy.

Both monitor instances (sm and sm2) are created with consistent EXECUTE_REMEDIATION strategy, ensuring the state persistence test accurately reflects real-world usage.

574-593: TestBootIDChangeHandling correctly includes ProcessingStrategy.

The boot ID detection test is updated to include the new parameter, maintaining test consistency across the test suite.

624-644: TestRunMultipleChecks validates handler initialization with ProcessingStrategy.

The test confirms that both XID and SXID handlers are properly initialized when multiple checks are configured, with the processing strategy parameter correctly passed through.

665-679: TestGPUFallenOffHandlerInitialization updated for completeness.

The GPU Fallen Off handler initialization test includes the ProcessingStrategy parameter, ensuring all handler types are tested with the new field.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (4)

325-343: LGTM: Clean filtering implementation for STORE_ONLY events.

The function correctly filters out STORE_ONLY health events with appropriate info-level logging for observability. The implementation aligns with the broader PR changes introducing ProcessingStrategy.

345-370: LGTM: Good extraction of K8s event creation logic.

The function consolidates event creation into a reusable helper, improving code organization. All required fields are properly populated from the health event.

388-400: LGTM: Correct integration of filtered events with node condition updates.

The timing instrumentation and error handling are properly applied to the filtered processableEvents set.

402-413: LGTM: Node event creation correctly uses filtered events.

The loop properly iterates over processableEvents and uses the new createK8sEvent helper for event creation.

tests/syslog_health_monitor_test.go (6)

25-31: LGTM: Import organization for centralized helpers.

The imports correctly bring in the test helpers package and required Kubernetes types for DaemonSet management.

36-41: LGTM: Context keys for test state management.

The context keys are well-named and follow a consistent pattern for storing test state across setup, assess, and teardown phases.

163-175: LGTM: Teardown correctly uses centralized helper.

The teardown properly retrieves context values and delegates to TearDownSyslogHealthMonitor for cleanup.

200-208: LGTM: Consistent use of helper constant for port.

Using helpers.StubJournalHTTPPort instead of a local constant improves maintainability.

368-385: LGTM: New test for STORE_ONLY strategy.

The test setup correctly passes true for updatedDaemonSet to configure the STORE_ONLY processing strategy.

411-422: LGTM: Teardown correctly restores DaemonSet.

The teardown passes true for updatedDaemonSet to restore the original DaemonSet configuration after the test.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (2)

33-58: LGTM: Clean addition of processingStrategy parameter.

The new parameter is properly integrated into the constructor and stored in the handler struct. This follows the same pattern used in other handlers (SXID, GPUFallen) as per the PR objectives.

189-189: LGTM: ProcessingStrategy correctly propagated to HealthEvent.

The strategy is properly assigned to the generated health event, enabling downstream filtering in the platform connector.

tests/helpers/syslog-health-monitor.go (3)

27-31: LGTM: Well-defined constants for test configuration.

The constants provide clear, maintainable values for port and resource names used across syslog health monitor tests.

33-74: LGTM: Comprehensive test setup helper.

The function properly:

Conditionally updates DaemonSet processing strategy

Locates the syslog pod on a worker node

Injects test metadata

Establishes port-forward

Sets node management label

Good use of require for critical setup steps that should fail the test if unsuccessful.

76-116: Consider handling originalDaemonSet nil case when updatedDaemonSet is true.

If SetUpSyslogHealthMonitor returns a nil originalDaemonSet when updatedDaemonSet is true (due to an error during setup that wasn't caught), RestoreDaemonSet could receive an unexpected nil. However, since SetUpSyslogHealthMonitor uses require.NoError, this scenario is unlikely in practice.

The teardown logic is otherwise well-structured with proper error handling and cleanup sequencing.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (5)

88-91: LGTM: Updated constructor call with processing strategy.

The test correctly passes pb.ProcessingStrategy_EXECUTE_REMEDIATION to the updated NewXIDHandler signature.

142-142: Good coverage of both processing strategies in test cases.

The test cases appropriately use different strategies (EXECUTE_REMEDIATION on line 142, STORE_ONLY on line 157) to ensure both paths are validated.

Also applies to: 157-157

194-194: LGTM: ProcessingStrategy assertions added to validation.

The new assertions correctly verify that the ProcessingStrategy field on generated HealthEvent objects matches the strategy configured on the handler.

Also applies to: 231-231

306-308: LGTM: Test handler initialized with processing strategy.

The test correctly initializes the handler with EXECUTE_REMEDIATION strategy for the TestCreateHealthEventFromResponse test cases.

348-348: LGTM: Consistent ProcessingStrategy validation in event creation tests.

Both test cases properly validate that the handler's processing strategy is propagated to the created health events.

Also applies to: 378-378

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

14-17: LGTM - Protobuf stubs correctly reflect ProcessingStrategy additions.

The generated .pyi stub file properly exposes the new ProcessingStrategy enum with EXECUTE_REMEDIATION and STORE_ONLY values, and extends HealthEvent with the processingStrategy field. The type hints follow protobuf Python conventions.

Also applies to: 31-32, 78-78, 104-104, 120-120, 138-138

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (3)

41-57: LGTM - Constructor signatures correctly extended with processingStrategy.

The processingStrategy parameter is properly threaded through both NewSyslogMonitor and NewSyslogMonitorWithFactory, maintaining the constructor chain. Adding as the last parameter is good practice for API evolution.

90-136: LGTM - ProcessingStrategy correctly propagated to all handlers.

The processingStrategy is properly stored in the SyslogMonitor struct and consistently passed to all three handler types (XID, SXID, GPUFallen). This ensures uniform processing behavior across different error check types.

802-814: LGTM - HealthEvent correctly includes ProcessingStrategy field.

The prepareHealthEventWithAction method properly sets ProcessingStrategy: sm.processingStrategy when constructing health events, ensuring the configured strategy is propagated through emitted events.

tests/helpers/kube.go (3)

2210-2249: LGTM - DaemonSet rollout wait logic is correct.

The function properly checks DesiredNumberScheduled, UpdatedNumberScheduled, and NumberReady status fields to determine rollout completion. The logging provides good visibility into rollout progress.

2252-2268: LGTM - Container argument update logic handles both flag formats.

The function correctly handles both --processing-strategy=VALUE and --processing-strategy VALUE argument formats, and properly appends the flag if it doesn't exist. The early return after modification prevents duplicate additions.

2312-2344: LGTM - RestoreDaemonSet properly restores original container configuration.

The function correctly handles nil input, uses retry.RetryOnConflict for safe updates, and waits for rollout completion. Replacing the entire containers slice ensures complete restoration of the original state.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go

store-client/pkg/client/mongodb_pipeline_builder.go

store-client/pkg/client/postgresql_pipeline_builder.go

tests/helpers/kube.go

tests/syslog_health_monitor_test.go

github-actions · 2025-12-22T08:05:03Z

Merging this branch will decrease overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/evaluator	41.95% (-0.57%)	👎
github.com/nvidia/nvsentinel/fault-quarantine/pkg/initializer	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid	0.00% (ø)
github.com/nvidia/nvsentinel/tests	0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Missed
github.com/nvidia/nvsentinel/fault-quarantine/pkg/initializer/init.go	0.00% (ø)	280	280
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/main.go	0.00% (ø)	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go	0.00% (ø)	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen/types.go	0.00% (ø)	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go	0.00% (ø)	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid/types.go	0.00% (ø)	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go	0.00% (ø)	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go	0.00% (ø)	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/types.go	0.00% (ø)	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go	0.00% (ø)	0	0
github.com/nvidia/nvsentinel/tests/helpers/event_exporter.go	0.00% (ø)	0	0
github.com/nvidia/nvsentinel/tests/helpers/healthevent.go	0.00% (ø)	0	0
github.com/nvidia/nvsentinel/tests/helpers/kube.go	0.00% (ø)	0	0
github.com/nvidia/nvsentinel/tests/helpers/syslog-health-monitor.go	0.00% (ø)	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/fault-quarantine/pkg/evaluator/rule_evaluator_test.go
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
github.com/nvidia/nvsentinel/tests/event_exporter_test.go
github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

tests/syslog_health_monitor_test.go (1)
51-52: Remove redundant error check.

Line 52 checks require.NoError(t, err, ...) but SetUpSyslogHealthMonitor returns (string, *v1.Pod, chan struct{}) with no error. The err variable is undefined.
🔎 Proposed fix
 	testNodeName, syslogPod, stopChan := helpers.SetUpSyslogHealthMonitor(ctx, t, client, nil)
-	require.NoError(t, err, "failed to set up syslog health monitor")
tests/helpers/kube.go (1)
2272-2277: Missing error when container is not found.

The loop that searches for the container by name silently does nothing if no match is found. This could mask test configuration errors where the wrong container name is provided.
🔎 Proposed fix to validate container exists
+	found := false
 	containers := daemonSet.Spec.Template.Spec.Containers
 	for i := range containers {
 		if containers[i].Name == containerName {
 			setArgsOnContainer(t, &containers[i], args)
+			found = true
 			break
 		}
 	}
+
+	if !found {
+		return fmt.Errorf("container %q not found in daemonset %s/%s", containerName, NVSentinelNamespace, daemonsetName)
+	}
 
 	return client.Resources().Update(ctx, daemonSet)
Based on learnings, this issue was previously flagged but remains unresolved.

🧹 Nitpick comments (4)

tests/helpers/kube.go (4)
2251-2255: Add godoc comment for exported function.

As per coding guidelines, exported functions require documentation comments. Add a godoc comment explaining the purpose, parameters, and behavior of this function.
🔎 Proposed godoc comment
+// UpdateDaemonSetArgs updates the specified container arguments in a DaemonSet and waits for rollout completion.
+// If containerName is empty, updates all containers. Otherwise, updates only the named container.
+// Uses retry.RetryOnConflict for automatic retry handling.
 func UpdateDaemonSetArgs(ctx context.Context, t *testing.T,
 	client klient.Client, daemonsetName string, containerName string,
 	args map[string]string) error {
As per coding guidelines, function comments are required for all exported Go functions.

2288-2289: Consider removing hard-coded sleep.

The 10-second sleep after rollout completion may not be necessary since waitForDaemonSetRollout already ensures pods are ready. If pods need additional time to stabilize, consider using require.Eventually to poll for the specific condition instead of a fixed delay.

2294-2297: Add godoc comment for exported function.

Add documentation for this exported function explaining its purpose, parameters, and behavior.
🔎 Proposed godoc comment
+// RemoveDaemonSetArgs removes the specified arguments from a DaemonSet container and waits for rollout completion.
+// If containerName is empty, removes from all containers. Otherwise, removes only from the named container.
+// Uses retry.RetryOnConflict for automatic retry handling.
 func RemoveDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client,
 	daemonsetName string,
 	containerName string, args map[string]string,
As per coding guidelines, function comments are required for all exported Go functions.

2411-2413: Add godoc comment for exported function.

Add documentation for this exported function explaining its purpose, parameters, return values, and the validation it performs.
🔎 Proposed godoc comment
+// GetDaemonSetPodOnWorkerNode returns a ready, running pod from the specified DaemonSet on a real worker node.
+// It waits for a pod matching podNamePattern that is not being deleted and has all containers ready.
+// Returns an error if no suitable pod is found within the timeout period.
 func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client,
 	daemonsetName string, podNamePattern string) (*v1.Pod, error) {
As per coding guidelines, function comments are required for all exported Go functions.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a216b57 and 776a34b.

📒 Files selected for processing (3)

tests/helpers/kube.go
tests/helpers/syslog-health-monitor.go
tests/syslog_health_monitor_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

tests/helpers/syslog-health-monitor.go

🧰 Additional context used

📓 Path-based instructions (2)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

tests/helpers/kube.go
tests/syslog_health_monitor_test.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

tests/syslog_health_monitor_test.go

🧠 Learnings (7)

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation

Applied to files:

tests/helpers/kube.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label

Applied to files:

tests/helpers/kube.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-07T04:16:43.079Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

tests/syslog_health_monitor_test.go

🧬 Code graph analysis (1)

tests/syslog_health_monitor_test.go (2)

tests/helpers/syslog-health-monitor.go (3)

SetUpSyslogHealthMonitor (33-71)

StubJournalHTTPPort (27-27)

TearDownSyslogHealthMonitor (74-113)

tests/helpers/kube.go (1)

PortForwardPod (1590-1634)

🔇 Additional comments (8)

tests/syslog_health_monitor_test.go (8)

89-89: LGTM: Consistent use of StubJournalHTTPPort.

The test correctly uses the centralized helpers.StubJournalHTTPPort constant for syslog message injection.

Also applies to: 123-123

160-171: LGTM: Teardown properly centralized.

The teardown correctly extracts context values and delegates to helpers.TearDownSyslogHealthMonitor.

185-216: LGTM: Manual setup appropriate for metadata-less test.

This test intentionally skips metadata injection to verify XID detection without metadata, so the manual setup (without using SetUpSyslogHealthMonitor) is appropriate.

296-306: LGTM: Proper use of setup helper.

The setup correctly delegates to helpers.SetUpSyslogHealthMonitor and stores context values.

346-360: LGTM: Proper use of teardown helper.

The teardown correctly extracts context values and delegates to helpers.TearDownSyslogHealthMonitor.

362-380: LGTM: Well-structured STORE_ONLY strategy test.

The new test properly configures the syslog health monitor with STORE_ONLY strategy via setup helper arguments and stores necessary context values for teardown.

382-404: LGTM: Accurate test description and assertions.

The assess description correctly describes the test intent: verifying that XID errors in STORE_ONLY mode do not create node conditions or cordon nodes. The assertions properly validate this behavior.

406-419: LGTM: Proper strategy restoration in teardown.

The teardown correctly restores EXECUTE_REMEDIATION strategy to avoid impacting subsequent tests.

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In
@platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go:
- Around line 1606-1612: The test currently indexes nvsentinelConditions[0] when
tc.expectNodeConditions is true which can panic if the slice is empty; update
the assertion to first assert the slice has at least one element (e.g.,
assert.Greater(t, len(nvsentinelConditions), 0) or assert.Len/NotEmpty) and only
then assert the Type value against tc.expectedConditionType, referring to the
existing test variables tc.expectNodeConditions and nvsentinelConditions to
locate where to add the length check before accessing index 0.
- Around line 1620-1626: The test accesses events.Items[0] directly which can
panic if the slice is empty; when tc.expectKubernetesEvents is true first assert
that events.Items is non-empty (e.g. use assert.Greater(t, len(events.Items), 0)
or assert.NotEmpty) and only then compare events.Items[0].Type to
tc.expectedEventType, ensuring the check references tc.expectKubernetesEvents,
events.Items and tc.expectedEventType so the test fails gracefully instead of
panicking.

In @tests/data/healthy-event.json:
- Around line 19-20: The processingStrategy for the healthy event is wrong:
update the JSON key "processingStrategy" in the healthy event object (the one
containing "nodeName": "NODE_NAME" and "isHealthy": true / "message": "No health
failures") from 1 to 2 so it uses STORE_ONLY rather than EXECUTE_REMEDIATION;
ensure the numeric value is changed to 2 to reflect observability-only behavior.

In @tests/helpers/kube.go:
- Around line 2334-2338: The loop over containers that calls
removeArgsFromContainer(&containers[i], args) must validate the target container
was found; add a boolean found flag (e.g., found := false) set to true when
containers[i].Name == containerName, break, and after the loop check found and
return an error (or propagate an appropriate error) if false so the caller knows
the container wasn't present—mirror the behavior in UpdateDaemonSetArgs and
reference the containerName, removeArgsFromContainer, and UpdateDaemonSetArgs
symbols when making the change.
- Around line 2296-2301: The loop over containers that compares
containers[i].Name to containerName and calls setArgsOnContainer(t,
&containers[i], args) misses handling when no match is found; update the code
that iterates over containers (the block referencing containers, containerName,
and setArgsOnContainer) to track whether a container was found (e.g., found :=
false), set found = true when you call setArgsOnContainer, and after the loop
either call t.Fatalf (or return an error from the helper) with a clear message
including containerName so the test fails loudly if no container matched.

In @tests/syslog_health_monitor_test.go:
- Around line 51-52: Remove the redundant error assertion: the call to
helpers.SetUpSyslogHealthMonitor(ctx, t, client, nil) returns (testNodeName,
syslogPod, stopChan) and not an error, so delete the require.NoError(t, err,
"failed to set up syslog health monitor") line (which references an
uninitialized err); ensure any later references to err in this test are handled
or removed as well.

🧹 Nitpick comments (5)

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)

1502-1524: Consider consolidating or differentiating duplicate test scenarios.

This test case ("STORE_ONLY non fatal event should not create Kubernetes event") appears to duplicate the scenario at lines 1423-1442. Both test STORE_ONLY with IsFatal: false expecting no node conditions and no Kubernetes events. If you intend to test different CheckName values, consider clarifying this in the test name or description; otherwise, consolidate to reduce redundancy.
data-models/protobufs/health_event.proto (1)
78-78: Add field-level documentation for processingStrategy.

While the enum itself is well-documented, the field lacks a comment explaining its purpose and usage context within the HealthEvent message.
📝 Suggested documentation
   BehaviourOverrides quarantineOverrides = 14;
   BehaviourOverrides drainOverrides = 15;
+  // Directs how downstream modules (platform connectors, event stores) should handle this event.
+  // Defaults to UNSPECIFIED if not explicitly set.
   ProcessingStrategy processingStrategy = 16;
Based on coding guidelines: "Include comprehensive comments for all fields in Protocol Buffer messages."
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)

33-58: ProcessingStrategy correctly threaded through constructor.

The processingStrategy parameter is properly captured and stored in the XIDHandler struct. The initialization follows the same pattern as other configuration fields.

Consider whether processingStrategy should be validated in the constructor to reject ProcessingStrategy_UNSPECIFIED (the proto default value of 0). While validation currently occurs in main.go before constructors are called, adding validation here would provide defensive programming and catch configuration errors at the point of construction, consistent with best practices for enum parameters that must have meaningful values.
tests/helpers/healthevent.go (1)
31-31: Consider using the enum type for type safety.

The ProcessingStrategy field is declared as int rather than protos.ProcessingStrategy. While this may be intentional for JSON marshaling flexibility, using the enum type would provide stronger type safety and prevent invalid values at compile time.
♻️ Suggested refactor for type safety

If JSON marshaling works correctly with the enum type, consider:
 type HealthEventTemplate struct {
 	Version             int                  `json:"version"`
 	Agent               string               `json:"agent"`
 	ComponentClass      string               `json:"componentClass,omitempty"`
 	CheckName           string               `json:"checkName"`
 	IsFatal             bool                 `json:"isFatal"`
 	IsHealthy           bool                 `json:"isHealthy"`
 	Message             string               `json:"message"`
 	RecommendedAction   int                  `json:"recommendedAction,omitempty"`
 	ErrorCode           []string             `json:"errorCode,omitempty"`
 	EntitiesImpacted    []EntityImpacted     `json:"entitiesImpacted,omitempty"`
 	Metadata            map[string]string    `json:"metadata,omitempty"`
 	QuarantineOverrides *QuarantineOverrides `json:"quarantineOverrides,omitempty"`
 	NodeName            string               `json:"nodeName"`
-	ProcessingStrategy  int                  `json:"processingStrategy,omitempty"`
+	ProcessingStrategy  protos.ProcessingStrategy `json:"processingStrategy,omitempty"`
 }
And update the constructor:
 func NewHealthEvent(nodeName string) *HealthEventTemplate {
 	return &HealthEventTemplate{
 		Version:        1,
 		Agent:          "gpu-health-monitor",
 		ComponentClass: "GPU",
 		CheckName:      "GpuXidError",
 		IsFatal:        true,
 		IsHealthy:      false,
 		NodeName:       nodeName,
 		EntitiesImpacted: []EntityImpacted{
 			{
 				EntityType:  "GPU",
 				EntityValue: "0",
 			},
 		},
-		ProcessingStrategy: int(protos.ProcessingStrategy_EXECUTE_REMEDIATION),
+		ProcessingStrategy: protos.ProcessingStrategy_EXECUTE_REMEDIATION,
 	}
 }
And update the fluent setter signature:
-func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate {
+func (h *HealthEventTemplate) WithProcessingStrategy(strategy protos.ProcessingStrategy) *HealthEventTemplate {
 	h.ProcessingStrategy = strategy
 	return h
 }
Also applies to: 49-49, 76-76
tests/fault_quarantine_test.go (1)
334-346: Teardown only resets SysLogsXIDError - consider resetting GpuPowerWatch as well.

The teardown sends a healthy event for SysLogsXIDError but doesn't explicitly clear the GpuPowerWatch state. While TeardownQuarantineTest may handle general cleanup, explicitly resetting both check types would make the test more self-contained and prevent potential state leakage to subsequent tests.
♻️ Suggested improvement
 feature.Teardown(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
     event := helpers.NewHealthEvent(testCtx.NodeName).
         WithErrorCode("79").
         WithHealthy(true).
         WithAgent(helpers.SYSLOG_HEALTH_MONITOR_AGENT).
         WithCheckName("SysLogsXIDError")
     helpers.SendHealthEvent(ctx, t, event)

+    // Also reset GpuPowerWatch state
+    powerEvent := helpers.NewHealthEvent(testCtx.NodeName).
+        WithErrorCode("DCGM_FR_CLOCK_THROTTLE_POWER").
+        WithHealthy(true).
+        WithCheckName("GpuPowerWatch")
+    helpers.SendHealthEvent(ctx, t, powerEvent)
+
     return helpers.TeardownQuarantineTest(ctx, t, c)
 })

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 776a34b and 0a1dc9a.

⛔ Files ignored due to path filters (1)

data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (35)

data-models/protobufs/health_event.proto
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
fault-quarantine/pkg/evaluator/rule_evaluator_test.go
fault-quarantine/pkg/initializer/init.go
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
health-monitors/syslog-health-monitor/main.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
health-monitors/syslog-health-monitor/pkg/gpufallen/types.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
health-monitors/syslog-health-monitor/pkg/sxid/types.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go
health-monitors/syslog-health-monitor/pkg/xid/types.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
tests/data/fatal-health-event.json
tests/data/healthy-event.json
tests/data/managed-by-nvsentinel-configmap.yaml
tests/data/unsupported-health-event.json
tests/event_exporter_test.go
tests/fault_quarantine_test.go
tests/helpers/event_exporter.go
tests/helpers/healthevent.go
tests/helpers/kube.go
tests/helpers/syslog-health-monitor.go
tests/syslog_health_monitor_test.go

🚧 Files skipped from review as they are similar to previous changes (9)

health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl
fault-quarantine/pkg/evaluator/rule_evaluator_test.go
tests/helpers/syslog-health-monitor.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go
fault-quarantine/pkg/initializer/init.go
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
tests/event_exporter_test.go

🧰 Additional context used

📓 Path-based instructions (4)

data-models/protobufs/**/*.proto

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

data-models/protobufs/health_event.proto

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

health-monitors/syslog-health-monitor/main.go
tests/helpers/healthevent.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
tests/fault_quarantine_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
health-monitors/syslog-health-monitor/pkg/sxid/types.go
tests/helpers/event_exporter.go
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
tests/helpers/kube.go
store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
health-monitors/syslog-health-monitor/pkg/gpufallen/types.go
tests/syslog_health_monitor_test.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
health-monitors/syslog-health-monitor/pkg/xid/types.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
tests/fault_quarantine_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
tests/syslog_health_monitor_test.go

**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py

🧠 Learnings (14)

📚 Learning: 2025-11-07T04:16:43.079Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

health-monitors/syslog-health-monitor/main.go
health-monitors/syslog-health-monitor/pkg/sxid/types.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
tests/syslog_health_monitor_test.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
health-monitors/syslog-health-monitor/pkg/xid/types.go

📚 Learning: 2025-12-22T16:16:31.660Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

tests/helpers/healthevent.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

tests/helpers/healthevent.go
tests/fault_quarantine_test.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
tests/syslog_health_monitor_test.go

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
tests/fault_quarantine_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-06T16:18:09.952Z

Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

tests/fault_quarantine_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

tests/fault_quarantine_test.go
tests/syslog_health_monitor_test.go

📚 Learning: 2025-12-23T10:34:13.121Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
tests/helpers/event_exporter.go
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go

📚 Learning: 2025-10-29T15:37:49.210Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

tests/data/managed-by-nvsentinel-configmap.yaml

📚 Learning: 2025-12-12T07:41:27.339Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:27.339Z
Learning: In tests/data/health-events-analyzer-config.yaml, the XID74Reg2Bit13Set rule intentionally omits the time window filter; tests should verify only the register bit pattern (bit 13 in REG2) on the incoming XID 74 event and should not rely on historical events or counts of repeats. If adding similar rules elsewhere, apply the same pattern and document that the time window filter is unnecessary for single-event bit checks.

Applied to files:

tests/data/managed-by-nvsentinel-configmap.yaml

📚 Learning: 2025-12-12T07:36:26.109Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:26.109Z
Learning: In NVSentinel health-events analyzer config files (e.g., tests/data/health-events-analyzer-config.yaml and similar), XID 74 errors are guaranteed to include a GPU_UUID entry within healthevent.entitiesimpacted. Therefore, when writing XID 74-specific rules that filter for GPU_UUID entities, you can skip null checks (no $ifNull) for GPU_UUID. Ensure tests relying on this assumption are updated accordingly and document the guarantee in the rule descriptions.

Applied to files:

tests/data/managed-by-nvsentinel-configmap.yaml

📚 Learning: 2025-12-12T07:38:37.023Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:37.023Z
Learning: In NVSentinel, XID 74 errors always include an NVLINK entry in healthevent.entitiesimpacted, so null-checking with $ifNull is unnecessary when filtering for NVLINK entities in XID 74-specific rules. Apply this rule to YAML test fixtures under tests/ data (e.g., tests/data/health-events-analyzer-config.yaml) and any similar health-event configuration tests. If applying in code, ensure downstream filters rely on the presence of NVLINK in entitiesimpacted for XID 74 only, but continue to guard other fields and XIDs with appropriate null checks.

Applied to files:

tests/data/managed-by-nvsentinel-configmap.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`

Applied to files:

health-monitors/syslog-health-monitor/pkg/sxid/types.go

📚 Learning: 2025-12-23T05:02:22.108Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: store-client/pkg/client/postgresql_pipeline_builder.go:119-132
Timestamp: 2025-12-23T05:02:22.108Z
Learning: In the NVSentinel codebase, protobuf fields stored in MongoDB should use lowercase field names (e.g., processingstrategy, componentclass, checkname). Ensure pipeline filters and queries that access protobuf fields in the database consistently use lowercase field names in the store-client package, avoiding camelCase mappings for MongoDB reads/writes.

Applied to files:

store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

tests/syslog_health_monitor_test.go

🧬 Code graph analysis (10)

health-monitors/syslog-health-monitor/main.go (1)

data-models/pkg/protos/health_event.pb.go (1)

ProcessingStrategy_value (58-62)

tests/helpers/healthevent.go (2)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-18)

data-models/pkg/protos/health_event.pb.go (5)

ProcessingStrategy (43-43)

ProcessingStrategy (75-77)

ProcessingStrategy (79-81)

ProcessingStrategy (88-90)

ProcessingStrategy_EXECUTE_REMEDIATION (47-47)

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)

data-models/pkg/protos/health_event.pb.go (2)

ProcessingStrategy_STORE_ONLY (48-48)

ProcessingStrategy_EXECUTE_REMEDIATION (47-47)

tests/fault_quarantine_test.go (5)

tests/helpers/fault_quarantine.go (3)

QuarantineTestContext (51-54)

SetupQuarantineTest (107-112)

TeardownQuarantineTest (181-242)

tests/helpers/kube.go (3)

SetNodeManagedByNVSentinel (1389-1408)

WaitForNodeConditionWithCheckName (1662-1706)

WaitForNodeEvent (349-385)

tests/helpers/healthevent.go (1)

NewHealthEvent (61-78)

tests/helpers/health_events_analyzer.go (1)

SYSLOG_HEALTH_MONITOR_AGENT (38-38)

data-models/pkg/protos/health_event.pb.go (2)

ProcessingStrategy_STORE_ONLY (48-48)

ProcessingStrategy_EXECUTE_REMEDIATION (47-47)

health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-18)

data-models/pkg/protos/health_event.pb.go (5)

ProcessingStrategy (43-43)

ProcessingStrategy (75-77)

ProcessingStrategy (79-81)

ProcessingStrategy (88-90)

ProcessingStrategy_EXECUTE_REMEDIATION (47-47)

health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (2)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-18)

data-models/pkg/protos/health_event.pb.go (5)

ProcessingStrategy (43-43)

ProcessingStrategy (75-77)

ProcessingStrategy (79-81)

ProcessingStrategy (88-90)

ProcessingStrategy_EXECUTE_REMEDIATION (47-47)

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (3)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-18)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (43-43)

ProcessingStrategy (75-77)

ProcessingStrategy (79-81)

ProcessingStrategy (88-90)

health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

XIDHandler (29-39)

tests/syslog_health_monitor_test.go (2)

tests/helpers/syslog-health-monitor.go (3)

SetUpSyslogHealthMonitor (33-71)

StubJournalHTTPPort (27-27)

TearDownSyslogHealthMonitor (74-113)

tests/helpers/kube.go (3)

InjectSyslogMessages (1820-1843)

PortForwardPod (1614-1658)

EnsureNodeConditionNotPresent (1797-1818)

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (3)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-18)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (43-43)

ProcessingStrategy (75-77)

ProcessingStrategy (79-81)

ProcessingStrategy (88-90)

health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)

SXIDHandler (29-36)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (43-43)

ProcessingStrategy (75-77)

ProcessingStrategy (79-81)

ProcessingStrategy (88-90)

🔇 Additional comments (57)

tests/data/fatal-health-event.json (1)

20-20: LGTM! Processing strategy addition is appropriate for fatal events.

The addition of processingStrategy: 1 (EXECUTE_REMEDIATION) aligns well with the fatal nature of this test event, as fatal health events would typically require remediation actions.

store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go (1)

404-404: LGTM! Field mapping follows established conventions.

The new processingstrategy → processingStrategy mapping correctly handles the case conversion between MongoDB's lowercase bson field names and PostgreSQL's camelCase JSON field names, consistent with the existing pattern and aligns with the project's field naming conventions.

Based on learnings, this ensures consistency with how protobuf fields are stored and queried across different data stores.

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)

1550-1559: LGTM on test setup and isolation.

Good use of local context, clientset, and ring buffer per test case to ensure test isolation. The deferred close(stopCh) ensures proper cleanup.

tests/data/managed-by-nvsentinel-configmap.yaml (1)

43-59: LGTM! The new rule-set structure is correct and properly mirrors the existing GPU health monitor rule.

Using the same taint and cordon settings for both gpu-health-monitor and syslog-health-monitor XID errors ensures consistent node quarantine behavior regardless of the detection source. The syslog-health-monitor component emits HealthEvents with agent == 'syslog-health-monitor' and checkName == 'SysLogsXIDError' as expected (extensively verified in the test suite).

health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (3)

243-246: LGTM: ProcessingStrategy correctly set for maintenance events.

The test expectations properly include ProcessingStrategy: pb.ProcessingStrategy_EXECUTE_REMEDIATION for CSP maintenance events, which aligns with the intended behavior where these events should trigger actual remediation workflows.

272-274: LGTM: Healthy event test updated correctly.

The ProcessingStrategy field is appropriately set to EXECUTE_REMEDIATION for healthy/recovery events, maintaining consistency with the quarantine event behavior.

336-338: LGTM: Unknown action case handles ProcessingStrategy consistently.

Even for events with unknown recommended actions (which default to NONE), the ProcessingStrategy is correctly set to EXECUTE_REMEDIATION, ensuring the event still participates in the remediation workflow.

data-models/protobufs/health_event.proto (1)

32-39: LGTM: Well-documented ProcessingStrategy enum.

The enum definition clearly documents the behavioral distinction between EXECUTE_REMEDIATION (modifies cluster state) and STORE_ONLY (observability-only). The three-value design (including UNSPECIFIED) follows protobuf best practices.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

14-18: LGTM: Generated Python type stubs correctly reflect proto changes.

The auto-generated Python type stub properly includes the new ProcessingStrategy enum and integrates it into the HealthEvent message with appropriate type annotations. The structure is consistent with the protobuf definition.

Also applies to: 32-34, 80-80, 106-106, 122-122, 140-140

health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

20-20: LGTM: ProcessingStrategy field added cleanly to XIDHandler.

The import and struct field addition follow Go conventions. The unexported field appropriately stores the processing strategy configuration for the handler, consistent with other configuration fields like nodeName and checkName.

Also applies to: 34-34

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)

189-189: LGTM: ProcessingStrategy propagated to HealthEvent.

The handler's processingStrategy is correctly assigned to the generated HealthEvent, completing the data flow from configuration to emitted events.

tests/helpers/healthevent.go (1)

155-158: LGTM: Fluent setter follows existing pattern.

The WithProcessingStrategy method correctly follows the established fluent builder pattern used by other methods in this struct.

health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)

362-364: LGTM: TODO properly references tracking PR.

The hardcoded ProcessingStrategy_EXECUTE_REMEDIATION value is acceptable as an interim implementation, and the TODO comment appropriately references PR #641 for making it configurable via the config file.

tests/data/unsupported-health-event.json (1)

20-20: LGTM: Test data includes ProcessingStrategy field.

The processingStrategy value of 1 corresponds to EXECUTE_REMEDIATION and aligns with the new ProcessingStrategy enum support.

tests/helpers/event_exporter.go (1)

221-226: LGTM: Validation logic correctly extended.

The ValidateCloudEvent function signature and implementation are correctly updated to validate the processingStrategy field in CloudEvent payloads, following the existing validation pattern.

Also applies to: 245-245

health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)

20-20: LGTM: Proper use of enum type for ProcessingStrategy.

The processingStrategy field correctly uses the pb.ProcessingStrategy enum type rather than int, providing type safety for production code. The import alias pb follows Go conventions.

Also applies to: 34-34

health-monitors/syslog-health-monitor/main.go (3)

70-72: LGTM!

The new --processing-strategy flag is well-defined with a sensible default (EXECUTE_REMEDIATION) and clear documentation of allowed values.

164-171: LGTM!

The validation logic correctly uses the protobuf-generated ProcessingStrategy_value map to validate the flag input, provides a clear error message for invalid values, and logs the configured strategy for observability.

173-184: LGTM!

The processingStrategy is correctly passed to NewSyslogMonitor, threading the configuration through to the monitor instance.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

1-51: Generated protobuf code - no manual review needed.

This file is auto-generated by the protocol buffer compiler. The changes correctly reflect the addition of the ProcessingStrategy enum and the new processingStrategy field in HealthEvent. Ensure this file is regenerated from the source .proto file rather than manually edited.

tests/fault_quarantine_test.go (4)

26-32: LGTM!

Import for the protos package is correctly added to access ProcessingStrategy enum values.

234-251: LGTM!

Test setup correctly initializes the quarantine test context and enables NVSentinel management on the node before running assertions.

253-289: LGTM!

Good coverage for STORE_ONLY behavior - the test correctly verifies that events with STORE_ONLY processing strategy do not trigger quarantine actions (no node condition, no cordoning, no annotation).

291-332: LGTM!

Good coverage for EXECUTE_REMEDIATION behavior - the test correctly verifies that events with EXECUTE_REMEDIATION processing strategy trigger the expected quarantine actions (node condition present, cordoned with annotation).

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (2)

30-42: LGTM!

The constructor correctly accepts the new processingStrategy parameter and initializes the handler field. The signature change aligns with the pattern used in other handlers (XID, GPUFallen).

95-110: LGTM!

The ProcessingStrategy is correctly propagated to the HealthEvent payload, ensuring the strategy configured at initialization flows through to emitted events.

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (7)

309-317: LGTM!

Test correctly verifies that NewSyslogMonitor accepts and stores the processingStrategy parameter. The assertion on line 317 confirms the field is properly set.

329-334: LGTM!

Test correctly verifies NewSyslogMonitorWithFactory with a different strategy (EXECUTE_REMEDIATION), providing coverage for both factory variants.

391-403: LGTM!

TestJournalProcessingLogic updated to include the processingStrategy parameter.

494-507: LGTM!

TestJournalStateManagement updated to include the processingStrategy parameter in both monitor instances.

574-586: LGTM!

TestBootIDChangeHandling updated to include the processingStrategy parameter.

624-637: LGTM!

TestRunMultipleChecks updated to include the processingStrategy parameter.

665-679: LGTM!

TestGPUFallenOffHandlerInitialization updated to include the processingStrategy parameter.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (8)

88-91: LGTM!

TestDetermineFatality correctly updated to include the processingStrategy parameter.

141-148: LGTM!

Table-driven test case for NVRM GPU Map Line correctly includes processingStrategy.

156-195: Good test coverage for STORE_ONLY processing strategy.

The "Valid XID Message" test case correctly uses STORE_ONLY and validates that the ProcessingStrategy is propagated to the event. This provides coverage for the non-default strategy path.

200-232: LGTM!

"Valid XID with GPU UUID" test case correctly validates EXECUTE_REMEDIATION propagation.

237-280: LGTM!

Error path test cases correctly updated with processingStrategy parameter.

306-349: LGTM!

TestCreateHealthEventFromResponse correctly validates that ProcessingStrategy is included in the generated health event.

351-379: LGTM!

"XID Event with GPU UUID" test case validates ProcessingStrategy propagation with entity enrichment.

427-446: LGTM!

TestNewXIDHandler table-driven tests correctly updated with processingStrategy parameter.

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (4)

17-25: LGTM!

New imports for os, filepath, and protobuf package are correctly added to support the new test functionality.

44-56: LGTM!

TestNewSXIDHandler correctly updated to include the processingStrategy parameter.

143-196: Excellent test coverage for SXID handler with valid topology.

This test validates the complete flow: metadata loading, SXID parsing, GPU resolution via NVSwitch link topology, and ProcessingStrategy propagation. The temporary metadata file approach is appropriate for isolated testing.

One minor observation: the test hardcodes entity indices (3, 4) for GPU assertions. If the entity order changes in the handler, these assertions would break. Consider using a helper to find entities by type, or document why these indices are expected.

198-249: LGTM!

TestProcessLine table-driven test correctly updated to include processingStrategy parameter and validate its propagation in the event.

tests/syslog_health_monitor_test.go (1)

362-422: LGTM! Well-structured STORE_ONLY strategy test.

The new test correctly:

Configures the syslog monitor with STORE_ONLY strategy during setup

Verifies that XID errors do not trigger node conditions or cordoning

Restores EXECUTE_REMEDIATION strategy during teardown to avoid affecting subsequent tests

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (3)

41-58: LGTM! Processing strategy parameter correctly threaded.

The processingStrategy parameter is properly added to NewSyslogMonitor and forwarded to NewSyslogMonitorWithFactory.

63-151: LGTM! Processing strategy initialization is correct.

The processingStrategy parameter is:

Added to the function signature

Stored in the SyslogMonitor struct (line 96)

Passed to all handler constructors (lines 110, 120, 130)

This correctly threads the strategy through the monitor initialization.

792-820: LGTM! Processing strategy correctly set on health events.

Line 813 correctly assigns sm.processingStrategy to event.ProcessingStrategy, ensuring the strategy is propagated to all health events generated by the monitor.

tests/helpers/kube.go (6)

387-409: LGTM!

The function correctly uses require.Never to assert that a specified event never appears on the node during the timeout period. The logic and error handling are appropriate for a negative assertion in tests.

2232-2273: LGTM!

The rollout wait logic correctly verifies that all DaemonSet pods are scheduled, updated, and ready before proceeding. The progress logging is helpful for test debugging.

2353-2383: LGTM!

The function correctly handles both --flag=value and --flag value argument styles, including proper bounds checking when inserting values.

2385-2407: LGTM!

The function correctly updates existing arguments in place or appends new ones, handling both value and boolean flag styles appropriately.

2409-2433: LGTM!

The function correctly removes arguments in both --flag=value and --flag value styles, properly handling the deletion of both the flag and its associated value when applicable.

2435-2479: LGTM!

The function correctly waits for a DaemonSet pod to be fully ready on a worker node, with proper validation that the pod is not being deleted, is running, and has all readiness checks passing.

health-monitors/syslog-health-monitor/pkg/gpufallen/types.go (2)

52-52: LGTM!

The processingStrategy field is well-integrated into the struct. The field naming follows Go conventions, and its placement before the mutex fields is appropriate for configuration data that's set during initialization.

22-23: LGTM! The import alias pb for the protos package follows Go conventions and is correctly structured.

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go

tests/data/healthy-event.json

tests/helpers/kube.go

tests/syslog_health_monitor_test.go

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In
@platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go:
- Line 1509: Two test cases in k8s_platform_connector_test.go set ErrorCode to a
slice containing an empty string (ErrorCode: []string{""}) which should be an
empty slice; update both instances to use an empty slice literal (ErrorCode:
[]string{}) so they match other tests and semantic intent. Locate the ErrorCode
field in the failing test case structs (the occurrences near the existing DCGM
error-code tests) and replace []string{""} with []string{} in both places.

🧹 Nitpick comments (6)

tests/helpers/kube.go (3)
387-409: Consider failing fast on persistent API errors.

If GetNodeEvents fails repeatedly due to API issues, require.Never will continue checking until timeout, masking the real error. Consider tracking consecutive failures and calling t.Fatalf after a threshold.
♻️ Proposed enhancement
 func EnsureNodeEventNotPresent(ctx context.Context, t *testing.T,
 	c klient.Client, nodeName string, eventType, eventReason string) {
 	t.Helper()
 
+	consecutiveErrors := 0
+	maxConsecutiveErrors := 3
+
 	require.Never(t, func() bool {
 		events, err := GetNodeEvents(ctx, c, nodeName, eventType)
 		if err != nil {
 			t.Logf("failed to get events for node %s: %v", nodeName, err)
+			consecutiveErrors++
+			if consecutiveErrors >= maxConsecutiveErrors {
+				t.Fatalf("failed to get events %d times in a row, likely API issue", consecutiveErrors)
+			}
 			return false
 		}
+		consecutiveErrors = 0
 
 		for _, event := range events.Items {
 			if event.Type == eventType && event.Reason == eventReason {
 				t.Logf("node %s has event %v", nodeName, event)
 				return true
 			}
 		}
 
 		t.Logf("node %s does not have event %v", nodeName, eventType)
 
 		return false
 	}, NeverWaitTimeout, WaitInterval, "node %s should not have event %v", nodeName, eventType)
 }
2317-2318: Clarify or remove the hard-coded sleep.

The 10-second sleep after rollout completion seems arbitrary, since waitForDaemonSetRollout already verifies all pods are updated and ready. If this addresses a specific race condition (e.g., waiting for containers to initialize), document it. Otherwise, consider removing it.

2362-2392: Consider simplifying the nested append.

Line 2384 uses nested append calls which create two allocations. For test helper code this is acceptable, but it could be slightly more efficient and clearer.
♻️ Alternative approach
 	// Match --flag or --flag value style
 	if existingArg == flag {
 		if value != "" {
 			if j+1 < len(container.Args) && !strings.HasPrefix(container.Args[j+1], "-") {
 				container.Args[j+1] = value
 			} else {
-				container.Args = append(container.Args[:j+1], append([]string{value}, container.Args[j+1:]...)...)
+				// Insert value after flag
+				container.Args = append(container.Args, "")
+				copy(container.Args[j+2:], container.Args[j+1:])
+				container.Args[j+1] = value
 			}
 		}
 
 		return true
 	}
tests/helpers/syslog-health-monitor.go (3)
15-15: Add package-level godoc comment.

Per Go conventions and coding guidelines, packages should have a package-level documentation comment explaining their purpose.
📝 Suggested documentation
+// Package helpers provides test utilities for setting up and tearing down
+// test environments for NVSentinel health monitors.
 package helpers
Based on coding guidelines, package-level godoc is required for all Go packages.

32-34: Follow Go godoc conventions for exported function comments.

Function comments for exported functions should start with the function name according to Go documentation conventions.
📝 Proposed fix
-// helper function to set up syslog health monitor and port forward to it
+// SetUpSyslogHealthMonitor sets up the syslog health monitor test environment by
+// optionally updating DaemonSet args, injecting metadata, establishing port-forwarding,
+// and configuring node labels. Returns the test node name, syslog pod, and stop channel.
 func SetUpSyslogHealthMonitor(ctx context.Context, t *testing.T,
 	client klient.Client, args map[string]string) (string, *v1.Pod, chan struct{}) {
Based on coding guidelines, function comments are required for all exported Go functions and should follow standard Go conventions.

73-76: Follow Go godoc conventions for exported function comments.

Function comments for exported functions should start with the function name and clearly describe what the function does.
📝 Proposed fix
-// helper function to roll back syslog health monitor daemonset and stop the port forward
+// TearDownSyslogHealthMonitor tears down the syslog health monitor test environment by
+// stopping port-forwarding, optionally restoring DaemonSet args, restarting the pod,
+// cleaning up metadata, and removing node labels.
 func TearDownSyslogHealthMonitor(ctx context.Context, t *testing.T, client klient.Client,
 	nodeName string, stopChan chan struct{},
 	args map[string]string, podName string) {
Based on coding guidelines, function comments are required for all exported Go functions and should follow standard Go conventions.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a1dc9a and 1df5558.

📒 Files selected for processing (4)

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
tests/helpers/kube.go
tests/helpers/syslog-health-monitor.go
tests/syslog_health_monitor_test.go

🧰 Additional context used

📓 Path-based instructions (2)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
tests/helpers/syslog-health-monitor.go
tests/syslog_health_monitor_test.go
tests/helpers/kube.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
tests/syslog_health_monitor_test.go

🧠 Learnings (8)

📓 Common learnings

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:40.921Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

📚 Learning: 2025-12-23T10:34:13.121Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-07T04:16:43.079Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label

Applied to files:

tests/helpers/kube.go

🧬 Code graph analysis (2)

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (4)

data-models/pkg/protos/health_event.pb.go (17)

HealthEvent (263-283)

HealthEvent (296-296)

HealthEvent (311-313)

Entity (211-217)

Entity (230-230)

Entity (245-247)

RecommendedAction (92-92)

RecommendedAction (142-144)

RecommendedAction (146-148)

RecommendedAction (155-157)

ProcessingStrategy (43-43)

ProcessingStrategy (75-77)

ProcessingStrategy (79-81)

ProcessingStrategy (88-90)

HealthEvents (159-165)

HealthEvents (178-178)

HealthEvents (193-195)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (5)

HealthEvent (63-141)

Entity (55-61)

RecommendedAction (20-30)

ProcessingStrategy (14-18)

HealthEvents (45-53)

platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)

NewK8sConnector (47-58)

platform-connectors/pkg/pipeline/factory.go (1)

Create (32-39)

tests/syslog_health_monitor_test.go (2)

tests/helpers/syslog-health-monitor.go (2)

SetUpSyslogHealthMonitor (33-71)

StubJournalHTTPPort (27-27)

tests/helpers/kube.go (2)

InjectSyslogMessages (1820-1843)

PortForwardPod (1614-1658)

🔇 Additional comments (9)

tests/helpers/kube.go (5)

2232-2273: LGTM!

The rollout verification logic correctly checks that all DaemonSet pods are scheduled, updated, and ready before proceeding.

2323-2360: LGTM!

Container validation is properly implemented, and the function correctly handles the retry-on-conflict pattern.

2394-2416: LGTM!

The function correctly updates existing arguments or appends new ones.

2418-2442: LGTM!

The function correctly removes arguments in both --flag=value and --flag value formats. The break after removal ensures only the first occurrence is removed, which is appropriate for typical container args.

2444-2488: LGTM!

The function properly validates the pod is from the current rollout, not being deleted, running, and fully ready before returning it.

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (3)

1606-1614: Previous concern properly addressed.

The code now uses require.NotEmpty before accessing nvsentinelConditions[0], which correctly prevents the index-out-of-range panic that was flagged in earlier reviews. The test will fail with a clear message if the expected condition is not found.

1622-1630: Previous concern properly addressed.

The code now uses require.NotEmpty before accessing events.Items[0], which correctly prevents the index-out-of-range panic that was flagged in earlier reviews. The test will fail with a clear message if the expected event is not found.

1391-1635: Well-structured test for processing strategy validation.

The test provides comprehensive coverage of the new ProcessingStrategy field:

STORE_ONLY events correctly bypass node condition and Kubernetes event creation

EXECUTE_REMEDIATION events properly create node conditions (fatal) or Kubernetes events (non-fatal)

Mixed strategy batches are handled correctly

The test follows Go best practices with table-driven structure, proper test isolation via dedicated clientsets, and clear assertions using testify (consistent with existing patterns in this file).

tests/syslog_health_monitor_test.go (1)

361-421: Well-structured test for STORE_ONLY strategy.

The new test correctly validates STORE_ONLY behavior by:

Setting up the monitor with --processing-strategy: STORE_ONLY

Verifying no node conditions are created after XID injection

Verifying no node cordoning occurs

Restoring EXECUTE_REMEDIATION strategy in teardown for proper cleanup

This follows the established test patterns and properly exercises the new processing strategy feature.

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @tests/helpers/kube.go:
- Around line 2326-2366: The RemoveDaemonSetArgs function uses
require.NoError(t, err, ...) which fails the test immediately and prevents
returning the error to the caller; change this to propagate the error like
UpdateDaemonSetArgs by replacing the require.NoError call with standard error
handling (if err != nil { t.Logf("failed to remove args from daemonset %s/%s:
%v", NVSentinelNamespace, daemonsetName, err); return err }) so the function
returns the error to callers while preserving the existing logs and rollout wait
behavior; keep the function signature as RemoveDaemonSetArgs(... ) error and
ensure callers handle the returned error.

🧹 Nitpick comments (3)

tests/helpers/kube.go (2)
387-409: Consider improving log message specificity.

Line 405 logs only eventType, but the function also receives eventReason as a parameter. Including both in the log message would provide more context for debugging.
♻️ Proposed improvement
-		t.Logf("node %s does not have event %v", nodeName, eventType)
+		t.Logf("node %s does not have event type=%s reason=%s", nodeName, eventType, eventReason)
2320-2321: Consider replacing fixed sleep with condition-based wait.

The hardcoded 10-second sleep after rollout completion could be replaced with polling for the specific condition you're waiting for (e.g., pod readiness, specific container state). This would make tests more reliable and potentially faster.
tests/syslog_health_monitor_test.go (1)
361-421: LGTM: Well-structured STORE_ONLY strategy test.

The test correctly verifies that STORE_ONLY mode prevents both condition creation and node cordoning when XID errors are injected. The setup/teardown properly configure and restore the processing strategy.
Optional: Make assess description more concise and complete

The description could mention both aspects being verified (condition and cordoning) more concisely:
-	feature.Assess("Inject XID errors and verify no node condition is created when running in STORE_ONLY strategy", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
+	feature.Assess("Verify STORE_ONLY strategy prevents condition creation and cordoning", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1df5558 and e5a8295.

📒 Files selected for processing (4)

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
tests/helpers/kube.go
tests/helpers/syslog-health-monitor.go
tests/syslog_health_monitor_test.go

🚧 Files skipped from review as they are similar to previous changes (2)

tests/helpers/syslog-health-monitor.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go

🧰 Additional context used

📓 Path-based instructions (2)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

tests/syslog_health_monitor_test.go
tests/helpers/kube.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

tests/syslog_health_monitor_test.go

🧠 Learnings (6)

📓 Common learnings

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-07T04:16:43.079Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

tests/syslog_health_monitor_test.go

🧬 Code graph analysis (1)

tests/syslog_health_monitor_test.go (2)

tests/helpers/syslog-health-monitor.go (3)

SetUpSyslogHealthMonitor (33-71)

StubJournalHTTPPort (27-27)

TearDownSyslogHealthMonitor (74-113)

tests/helpers/kube.go (3)

InjectSyslogMessages (1820-1843)

PortForwardPod (1614-1658)

EnsureNodeConditionNotPresent (1797-1818)

🔇 Additional comments (6)

tests/helpers/kube.go (3)

2232-2273: LGTM!

The rollout verification logic correctly checks all necessary DaemonSet status fields (DesiredNumberScheduled, UpdatedNumberScheduled, NumberReady) to ensure complete rollout.

2368-2448: LGTM!

The container argument manipulation helpers correctly handle multiple argument styles (--flag=value, --flag value, --flag) with proper slice operations for updates, insertions, and removals.

2450-2494: LGTM!

The function properly validates that the DaemonSet pod is running, ready, and not being deleted, ensuring tests interact with stable pods from the current rollout generation.

tests/syslog_health_monitor_test.go (3)

25-39: LGTM: Clean imports and context keys.

The helper import and updated context key names improve code clarity and maintainability.

42-173: LGTM: Well-refactored setup and teardown.

The test now uses centralized helper functions consistently, improving maintainability. The context-based value passing follows the e2e-framework pattern correctly.

290-359: LGTM: Consistent refactoring pattern.

The SXID detection test follows the same centralized setup/teardown pattern as the XID detection test, ensuring consistency across the test suite.

tests/helpers/kube.go

github-actions · 2026-01-07T15:18:03Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes	82.99% (+0.21%)	👍
github.com/nvidia/nvsentinel/tests	0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/tests/helpers/kube.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/tests/helpers/syslog-health-monitor.go	0.00% (ø)	0	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @tests/syslog_health_monitor_test.go:
- Around line 405-418: The args map passed to TearDownSyslogHealthMonitor uses a
non-empty value for "--processing-strategy" even though removeArgsFromContainer
(and RemoveDaemonSetArgs) only inspects map keys; change the map entry to
"--processing-strategy": "" (empty string) or add an inline comment next to the
map literal documenting that only keys are used by
removeArgsFromContainer/RemoveDaemonSetArgs so values are ignored, so readers
aren't misled by a non-empty value.

🧹 Nitpick comments (4)

tests/helpers/kube.go (3)
2281-2293: Unused variable originalDaemonSet.

The variable originalDaemonSet is assigned on line 2292 but never used. This appears to be leftover code, possibly from a previous implementation that stored the original state for rollback.
♻️ Remove unused variable
 func UpdateDaemonSetArgs(ctx context.Context, t *testing.T,
 	client klient.Client, daemonsetName string, containerName string,
 	args map[string]string) error {
 	t.Helper()
 
-	var originalDaemonSet *appsv1.DaemonSet
-
 	t.Logf("Updating daemonset %s/%s with args %v", NVSentinelNamespace, daemonsetName, args)
 
 	err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		daemonSet := &appsv1.DaemonSet{}
 		if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil {
 			return err
 		}
 
-		if originalDaemonSet == nil {
-			originalDaemonSet = daemonSet.DeepCopy()
-		}
-
 		containers := daemonSet.Spec.Template.Spec.Containers
2321-2322: Magic sleep after rollout.

The 10-second hardcoded sleep after rollout completion is a code smell. If pods need additional time after becoming "Ready", consider either:

Adding a specific readiness check for the functionality you're waiting for

Extracting this as a named constant with documentation explaining why it's needed

2452-2496: Parameter daemonsetName is underutilized.

The daemonsetName parameter is only used in the error message on line 2492. The function doesn't verify that the found pod actually belongs to the specified DaemonSet - it relies solely on podNamePattern matching.

This could be intentional if the pattern is specific enough, but consider either:

Adding validation that the pod's owner reference matches the DaemonSet

Removing daemonsetName if it's truly not needed
tests/helpers/syslog-health-monitor.go (1)

73-113: Consider adding error handling for cleanup operations.

The teardown function logs warnings for errors during cleanup (lines 89, 111) but uses require.NoError for RemoveDaemonSetArgs (line 82). This inconsistency could cause test failures during teardown if the arg removal fails, while other cleanup steps are more lenient.

For consistency in teardown, consider either:

Making all cleanup steps warn-only to ensure full cleanup attempts

Or keeping require.NoError only for critical restoration steps

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5a8295 and 0752eb5.

📒 Files selected for processing (4)

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
tests/helpers/kube.go
tests/helpers/syslog-health-monitor.go
tests/syslog_health_monitor_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go

🧰 Additional context used

📓 Path-based instructions (2)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

tests/syslog_health_monitor_test.go
tests/helpers/syslog-health-monitor.go
tests/helpers/kube.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

tests/syslog_health_monitor_test.go

🧠 Learnings (7)

📓 Common learnings

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-07T04:16:43.079Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-12-22T16:16:31.660Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

tests/helpers/kube.go

🧬 Code graph analysis (1)

tests/syslog_health_monitor_test.go (3)

tests/helpers/syslog-health-monitor.go (3)

SetUpSyslogHealthMonitor (33-71)

StubJournalHTTPPort (27-27)

TearDownSyslogHealthMonitor (74-113)

tests/helpers/kube.go (2)

InjectSyslogMessages (1820-1843)

PortForwardPod (1614-1658)

tests/helpers/fault_quarantine.go (1)

AssertQuarantineState (315-382)

🔇 Additional comments (15)

tests/helpers/kube.go (5)

387-409: LGTM - Clean implementation of negative event assertion.

The function correctly uses require.Never to assert that a specific event type/reason never appears on a node within the timeout period. The error handling on line 393-396 appropriately returns false (meaning "condition not met, keep checking") rather than failing the test.

2232-2273: LGTM - Robust rollout wait implementation.

The function correctly checks all necessary conditions for a complete DaemonSet rollout:

DesiredNumberScheduled > 0

UpdatedNumberScheduled == DesiredNumberScheduled

NumberReady == DesiredNumberScheduled

Good logging at each step for debugging.

2370-2400: Slice insertion logic is correct but complex.

The slice manipulation on lines 2391-2393 correctly inserts a value after the flag when the next argument is another flag (starts with -). The logic handles both --flag=value and --flag value styles appropriately.

2402-2424: LGTM - Clean implementation of arg setting.

The function correctly:

Attempts to update existing args via tryUpdateExistingArg

Falls back to appending new args if not found

Handles both --flag=value and boolean flag styles

2426-2450: LGTM - Correct arg removal implementation.

The function properly handles both --flag=value and --flag value styles when removing arguments. Breaking after modification avoids issues with concurrent slice modification during iteration.

tests/syslog_health_monitor_test.go (7)

36-39: LGTM - Clear context key definitions.

The context keys follow Go naming conventions and provide clear identifiers for passing test state between setup, assess, and teardown phases.

47-58: LGTM - Centralized setup pattern.

The setup correctly uses the new helpers.SetUpSyslogHealthMonitor helper, storing node name, pod name, and stop channel in context for use in assess and teardown phases.

159-170: LGTM - Centralized teardown pattern.

The teardown correctly retrieves context values and delegates to helpers.TearDownSyslogHealthMonitor. Passing nil for args indicates no DaemonSet argument restoration is needed.

195-203: LGTM - Consistent port configuration.

Using helpers.StubJournalHTTPPort for both local and remote ports ensures consistency across tests and centralizes the port configuration.

295-359: LGTM - Consistent setup/teardown pattern.

The SXID detection test follows the same centralized setup/teardown pattern as the XID detection test, maintaining consistency across the test suite.

361-379: Well-structured STORE_ONLY strategy test.

The test correctly:

Follows naming conventions per coding guidelines

Sets up with --processing-strategy=STORE_ONLY argument

Stores context values for teardown

Good use of the centralized setup helper with custom arguments.

381-403: LGTM - Correct STORE_ONLY behavior validation.

The assess phase correctly verifies that when running in STORE_ONLY mode:

XID messages are injected

No node condition is created (EnsureNodeConditionNotPresent)

Node is not quarantined/cordoned (AssertQuarantineState)

This aligns with the expected behavior where STORE_ONLY events should be stored but not trigger remediation.

tests/helpers/syslog-health-monitor.go (3)

26-30: LGTM - Clear constant definitions.

The constants properly define the port and resource names used by the syslog health monitor test helpers, making them easy to reference consistently across tests.

32-71: LGTM - Well-structured setup helper.

The function provides a clean setup sequence:

Optionally updates DaemonSet args for test configuration

Locates the syslog health monitor pod on a worker node

Injects test metadata

Establishes port-forwarding

Configures the node label

Good use of require.NoError for critical setup steps and returning all necessary values for teardown.

91-101: No changes needed - the condition status check logic is correct.

The code at lines 93-100 properly checks for condition.Status == v1.ConditionFalse with reason "SysLogsXIDErrorIsHealthy" to verify the error has been cleared. This is consistent with the codebase pattern:

Status=True with reason "IsNotHealthy" indicates an active error

Status=False with reason "IsHealthy" indicates the error has been resolved and the condition is no longer active

The check is semantically correct and follows established conventions throughout the test suite.

tests/syslog_health_monitor_test.go

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In
@distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl:
- Around line 103-104: The argument value for the processing strategy is
unquoted and may break YAML parsing; update the template where the args list
includes the "--processing-strategy" entry to wrap {{
$root.Values.processingStrategy }} in quotes (i.e. change the value reference
used with the "--processing-strategy" arg to a quoted string) so it matches
other quoted arguments and ensures YAML safety.

In @platform-connectors/pkg/connectors/kubernetes/process_node_events.go:
- Around line 347-372: The createK8sEvent function can panic if
healthEvent.GeneratedTimestamp is nil; update createK8sEvent in K8sConnector to
defensively check healthEvent.GeneratedTimestamp before calling AsTime() and use
a fallback (e.g. metav1.Now() or zero time) when nil, assign the resulting
metav1.Time to a local variable and reuse it for FirstTimestamp and
LastTimestamp so both fields are safe and consistent.

In @tests/syslog_health_monitor_test.go:
- Around line 405-416: The teardown currently calls
helpers.TearDownSyslogHealthMonitor with
map["--processing-strategy":"EXECUTE_REMEDIATION"], but setup used "STORE_ONLY",
so RemoveDaemonSetArgs won't remove the real flag; change the
TearDownSyslogHealthMonitor call to either pass nil for the args map (to skip
arg removal) or pass "--processing-strategy":"STORE_ONLY" to match the setup;
update the invocation that supplies the args map (the call to
TearDownSyslogHealthMonitor) accordingly so cleanup actually removes the
argument.

🧹 Nitpick comments (9)

tests/fault_quarantine_test.go (1)
327-336: Consider explicitly setting processingStrategy in teardown for clarity.

The healthy event relies on the default EXECUTE_REMEDIATION strategy to clear the quarantine state. While this works correctly (per project convention), being explicit improves test readability and makes the intent clearer.
💡 Optional improvement
 	feature.Teardown(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
 		event := helpers.NewHealthEvent(testCtx.NodeName).
 			WithErrorCode("79").
 			WithHealthy(true).
 			WithAgent(helpers.SYSLOG_HEALTH_MONITOR_AGENT).
-			WithCheckName("SysLogsXIDError")
+			WithCheckName("SysLogsXIDError").
+			WithProcessingStrategy(int(protos.ProcessingStrategy_EXECUTE_REMEDIATION))
 		helpers.SendHealthEvent(ctx, t, event)
tests/helpers/syslog-health-monitor.go (1)
91-101: Clarify the condition check semantics in the comment.

The comment says "condition should be cleared" but the code checks condition.Status == v1.ConditionFalse, which verifies the condition exists with a healthy status rather than being removed. Consider updating the comment or assertion message to match the actual check.
💡 Suggested clarification
 	} else {
-		t.Logf("Waiting for SysLogsXIDError condition to be cleared from node %s", nodeName)
+		t.Logf("Waiting for SysLogsXIDError condition to become healthy on node %s", nodeName)
 		require.Eventually(t, func() bool {
 			condition, err := CheckNodeConditionExists(ctx, client, nodeName,
 				"SysLogsXIDError", "SysLogsXIDErrorIsHealthy")
 			if err != nil {
 				t.Logf("Failed to check node condition: %v", err)
 				return false
 			}

 			return condition != nil && condition.Status == v1.ConditionFalse
-		}, EventuallyWaitTimeout, WaitInterval, "SysLogsXIDError condition should be cleared")
+		}, EventuallyWaitTimeout, WaitInterval, "SysLogsXIDError condition should transition to healthy")
 	}
tests/helpers/kube.go (3)
2281-2293: Remove unused originalDaemonSet variable.

The variable originalDaemonSet is assigned on line 2292 but never used afterward. This appears to be leftover code from a potential restore/rollback feature that wasn't implemented.
🧹 Proposed fix
 func UpdateDaemonSetArgs(ctx context.Context, t *testing.T,
 	client klient.Client, daemonsetName string, containerName string,
 	args map[string]string) error {
 	t.Helper()

-	var originalDaemonSet *appsv1.DaemonSet
-
 	t.Logf("Updating daemonset %s/%s with args %v", NVSentinelNamespace, daemonsetName, args)

 	err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		daemonSet := &appsv1.DaemonSet{}
 		if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil {
 			return err
 		}

-		if originalDaemonSet == nil {
-			originalDaemonSet = daemonSet.DeepCopy()
-		}
-
 		containers := daemonSet.Spec.Template.Spec.Containers
2321-2322: Consider removing or documenting the hardcoded sleep.

The waitForDaemonSetRollout already waits until NumberReady == DesiredNumberScheduled, which means pods should be ready. The additional 10-second sleep may be unnecessary or indicates that there's additional initialization happening after pod readiness that isn't captured by the rollout status.

If this sleep is necessary for container initialization or process startup, consider adding a comment explaining why, or replace it with an explicit readiness check.

2327-2368: Inconsistent error handling - require.NoError vs returning error.

The function signature returns error, but line 2360 uses require.NoError which will fail the test immediately if an error occurs. This means the function can never return a non-nil error in practice. Either:

Remove the error return and always use require.NoError, or

Return the error and let callers decide how to handle it (consistent with UpdateDaemonSetArgs)
♻️ Option 1: Align with UpdateDaemonSetArgs pattern
 	err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		// ... existing code ...
 	})
-	require.NoError(t, err, "failed to remove args from daemonset %s/%s", NVSentinelNamespace, daemonsetName)
+	if err != nil {
+		return err
+	}

 	t.Logf("Waiting for daemonset %s/%s rollout to complete after restoration", NVSentinelNamespace, daemonsetName)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)

1422-1442: Consider consolidating near-duplicate test cases.

Test cases "STORE_ONLY non-fatal event should not create Kubernetes event" (lines 1423-1442) and "STORE_ONLY non fatal event should not create Kubernetes event" (lines 1503-1524) appear to test very similar scenarios with only minor differences (different CheckName and ErrorCode).

Consider consolidating these into a single parameterized test case or making the distinction between them clearer in the descriptions if they're testing meaningfully different code paths.

Also applies to: 1502-1524
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)
388-435: Test uses EXECUTE_REMEDIATION but doesn't assert strategy on emitted events.

The TestJournalProcessingLogic test creates a handler with EXECUTE_REMEDIATION but the mockHandler.ProcessLine (lines 444-469) doesn't set ProcessingStrategy on the returned event. Consider updating the mock to propagate the strategy for more comprehensive testing.
♻️ Suggested enhancement for mockHandler
 type mockHandler struct {
 	nodeName              string
 	defaultAgentName      string
 	defaultComponentClass string
 	checkName             string
+	processingStrategy    pb.ProcessingStrategy
 }
 
 func (mh *mockHandler) ProcessLine(message string) (*pb.HealthEvents, error) {
 	if !strings.Contains(message, "sxid123") {
 		return nil, nil
 	}
 	event := &pb.HealthEvent{
 		Version:            1,
 		Agent:              mh.defaultAgentName,
 		CheckName:          mh.checkName,
 		ComponentClass:     mh.defaultComponentClass,
 		GeneratedTimestamp: timestamppb.New(time.Now()),
 		EntitiesImpacted: []*pb.Entity{
 			{EntityType: "GPU", EntityValue: "44"},
 		},
 		Message:           "TestMessage",
 		IsFatal:           true,
 		IsHealthy:         false,
 		NodeName:          mh.nodeName,
 		RecommendedAction: pb.RecommendedAction_RESTART_BM,
 		ErrorCode:         []string{"123"},
+		ProcessingStrategy: mh.processingStrategy,
 	}
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (1)
191-196: Fragile entity assertions using hardcoded array indices.

The assertions at lines 192-195 assume specific positions in the EntitiesImpacted array ([3] and [4]). If the order of entities changes in the implementation, these tests will fail even if the correct entities are present.

Consider using a pattern similar to TestProcessLine in gpufallen_handler_test.go (lines 124-136) which iterates through entities to find them by type rather than assuming positions.
♻️ Suggested refactor for robust entity assertions
-	// Verify GPU entity
-	assert.Equal(t, "GPU", event.EntitiesImpacted[3].EntityType)
-	assert.Equal(t, "1", event.EntitiesImpacted[3].EntityValue)
-	assert.Equal(t, "GPU_UUID", event.EntitiesImpacted[4].EntityType)
-	assert.Equal(t, "GPU-aaaabbbb-cccc-dddd-eeee-ffffffffffff", event.EntitiesImpacted[4].EntityValue)
+	// Verify GPU entities by type rather than position
+	var hasGPU, hasGPUUUID bool
+	for _, entity := range event.EntitiesImpacted {
+		switch entity.EntityType {
+		case "GPU":
+			hasGPU = true
+			assert.Equal(t, "1", entity.EntityValue)
+		case "GPU_UUID":
+			hasGPUUUID = true
+			assert.Equal(t, "GPU-aaaabbbb-cccc-dddd-eeee-ffffffffffff", entity.EntityValue)
+		}
+	}
+	assert.True(t, hasGPU, "Should have GPU entity")
+	assert.True(t, hasGPUUUID, "Should have GPU_UUID entity")
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
377-388: Consider simplifying unused nodeConditions slice construction.

The nodeConditions slice is constructed but only used to check if it's non-empty (line 390). The actual node condition updates are performed inside updateNodeConditions using the raw processableEvents. You could simplify by counting qualifying events instead of building a full slice.
♻️ Optional simplification
-	var nodeConditions []corev1.NodeCondition
-
-	for _, healthEvent := range processableEvents {
-		if healthEvent.IsHealthy || healthEvent.IsFatal {
-			nodeConditions = append(nodeConditions, corev1.NodeCondition{
-				Type:               corev1.NodeConditionType(healthEvent.CheckName),
-				LastHeartbeatTime:  metav1.NewTime(healthEvent.GeneratedTimestamp.AsTime()),
-				LastTransitionTime: metav1.NewTime(healthEvent.GeneratedTimestamp.AsTime()),
-				Message:            r.fetchHealthEventMessage(healthEvent),
-			})
-		}
-	}
-
-	if len(nodeConditions) > 0 {
+	hasConditionUpdates := false
+	for _, healthEvent := range processableEvents {
+		if healthEvent.IsHealthy || healthEvent.IsFatal {
+			hasConditionUpdates = true
+			break
+		}
+	}
+
+	if hasConditionUpdates {

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0752eb5 and 7952551.

⛔ Files ignored due to path filters (1)

data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (32)

data-models/protobufs/health_event.proto
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
docs/designs/025-processing-strategy-for-health-checks.md
health-monitors/syslog-health-monitor/main.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
health-monitors/syslog-health-monitor/pkg/gpufallen/types.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
health-monitors/syslog-health-monitor/pkg/sxid/types.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go
health-monitors/syslog-health-monitor/pkg/xid/types.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
platform-connectors/pkg/connectors/kubernetes/process_node_events.go
platform-connectors/pkg/server/platform_connector_server.go
platform-connectors/pkg/server/platform_connector_server_test.go
store-client/pkg/client/mongodb_pipeline_builder.go
store-client/pkg/client/pipeline_builder.go
store-client/pkg/client/postgresql_pipeline_builder.go
tests/event_exporter_test.go
tests/fault_quarantine_test.go
tests/helpers/event_exporter.go
tests/helpers/fault_quarantine.go
tests/helpers/healthevent.go
tests/helpers/kube.go
tests/helpers/syslog-health-monitor.go
tests/syslog_health_monitor_test.go

🚧 Files skipped from review as they are similar to previous changes (8)

health-monitors/syslog-health-monitor/pkg/gpufallen/types.go
health-monitors/syslog-health-monitor/main.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
tests/helpers/healthevent.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
health-monitors/syslog-health-monitor/pkg/sxid/types.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml

🧰 Additional context used

📓 Path-based instructions (3)

data-models/protobufs/**/*.proto

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

data-models/protobufs/health_event.proto

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

platform-connectors/pkg/server/platform_connector_server.go
platform-connectors/pkg/server/platform_connector_server_test.go
health-monitors/syslog-health-monitor/pkg/xid/types.go
tests/helpers/fault_quarantine.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
store-client/pkg/client/pipeline_builder.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go
store-client/pkg/client/postgresql_pipeline_builder.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
tests/helpers/event_exporter.go
store-client/pkg/client/mongodb_pipeline_builder.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
tests/helpers/kube.go
tests/helpers/syslog-health-monitor.go
tests/syslog_health_monitor_test.go
platform-connectors/pkg/connectors/kubernetes/process_node_events.go
tests/event_exporter_test.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
tests/fault_quarantine_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

platform-connectors/pkg/server/platform_connector_server_test.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
tests/syslog_health_monitor_test.go
tests/event_exporter_test.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
tests/fault_quarantine_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go

🧠 Learnings (14)

📓 Common learnings

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

📚 Learning: 2026-01-07T09:54:49.335Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

data-models/protobufs/health_event.proto
platform-connectors/pkg/server/platform_connector_server.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
tests/fault_quarantine_test.go
docs/designs/025-processing-strategy-for-health-checks.md

📚 Learning: 2025-12-23T10:34:13.121Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

data-models/protobufs/health_event.proto
platform-connectors/pkg/server/platform_connector_server_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
tests/helpers/event_exporter.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
platform-connectors/pkg/connectors/kubernetes/process_node_events.go
docs/designs/025-processing-strategy-for-health-checks.md

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

platform-connectors/pkg/server/platform_connector_server_test.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
tests/syslog_health_monitor_test.go
tests/event_exporter_test.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
tests/fault_quarantine_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go

📚 Learning: 2025-11-07T04:16:43.079Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

health-monitors/syslog-health-monitor/pkg/xid/types.go
tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-06T16:18:09.952Z

Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

tests/helpers/fault_quarantine.go
tests/fault_quarantine_test.go

📚 Learning: 2025-12-23T05:02:22.108Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: store-client/pkg/client/postgresql_pipeline_builder.go:119-132
Timestamp: 2025-12-23T05:02:22.108Z
Learning: In the NVSentinel codebase, protobuf fields stored in MongoDB should use lowercase field names (e.g., processingstrategy, componentclass, checkname). Ensure pipeline filters and queries that access protobuf fields in the database consistently use lowercase field names in the store-client package, avoiding camelCase mappings for MongoDB reads/writes.

Applied to files:

store-client/pkg/client/pipeline_builder.go
store-client/pkg/client/postgresql_pipeline_builder.go
store-client/pkg/client/mongodb_pipeline_builder.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
tests/syslog_health_monitor_test.go
tests/event_exporter_test.go
tests/fault_quarantine_test.go

📚 Learning: 2025-12-22T16:16:31.660Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

tests/helpers/kube.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

tests/syslog_health_monitor_test.go
tests/event_exporter_test.go
tests/fault_quarantine_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

tests/event_exporter_test.go

📚 Learning: 2025-11-04T06:31:02.147Z

Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

docs/designs/025-processing-strategy-for-health-checks.md

📚 Learning: 2025-11-10T10:25:19.443Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

docs/designs/025-processing-strategy-for-health-checks.md

🧬 Code graph analysis (14)

platform-connectors/pkg/server/platform_connector_server.go (2)

data-models/pkg/protos/health_event.pb.go (6)

ProcessingStrategy (44-44)

ProcessingStrategy (76-78)

ProcessingStrategy (80-82)

ProcessingStrategy (89-91)

ProcessingStrategy_UNSPECIFIED (47-47)

ProcessingStrategy_EXECUTE_REMEDIATION (48-48)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-18)

platform-connectors/pkg/server/platform_connector_server_test.go (1)

data-models/pkg/protos/health_event.pb.go (10)

ProcessingStrategy (44-44)

ProcessingStrategy (76-78)

ProcessingStrategy (80-82)

ProcessingStrategy (89-91)

ProcessingStrategy_UNSPECIFIED (47-47)

ProcessingStrategy_EXECUTE_REMEDIATION (48-48)

ProcessingStrategy_STORE_ONLY (49-49)

HealthEvent (264-284)

HealthEvent (297-297)

HealthEvent (312-314)

tests/helpers/fault_quarantine.go (1)

tests/helpers/kube.go (1)

NVSentinelNamespace (64-64)

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (3)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (44-44)

ProcessingStrategy (76-78)

ProcessingStrategy (80-82)

ProcessingStrategy (89-91)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-18)

health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)

SXIDHandler (29-36)

store-client/pkg/client/pipeline_builder.go (1)

store-client/pkg/client/mongodb_client.go (1)

BuildNonFatalUnhealthyInsertsPipeline (296-299)

store-client/pkg/client/postgresql_pipeline_builder.go (2)

store-client/pkg/datastore/types.go (4)

ToPipeline (161-163)

D (131-133)

E (126-128)

A (136-138)

data-models/pkg/protos/health_event.pb.go (1)

ProcessingStrategy_EXECUTE_REMEDIATION (48-48)

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (2)

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (1)

NewSXIDHandler (30-42)

data-models/pkg/protos/health_event.pb.go (2)

ProcessingStrategy_EXECUTE_REMEDIATION (48-48)

RecommendedAction_CONTACT_SUPPORT (98-98)

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (5)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (44-44)

ProcessingStrategy (76-78)

ProcessingStrategy (80-82)

ProcessingStrategy (89-91)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-18)

health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go (1)

SyslogMonitor (47-67)

health-monitors/syslog-health-monitor/pkg/syslog-monitor/journal_real.go (1)

GetDefaultJournalFactory (104-106)

health-monitors/syslog-health-monitor/pkg/syslog-monitor/journal_stub.go (1)

GetDefaultJournalFactory (210-254)

store-client/pkg/client/mongodb_pipeline_builder.go (2)

store-client/pkg/datastore/types.go (4)

ToPipeline (161-163)

D (131-133)

E (126-128)

A (136-138)

data-models/pkg/protos/health_event.pb.go (1)

ProcessingStrategy_EXECUTE_REMEDIATION (48-48)

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)

data-models/pkg/protos/health_event.pb.go (21)

HealthEvent (264-284)

HealthEvent (297-297)

HealthEvent (312-314)

Entity (212-218)

Entity (231-231)

Entity (246-248)

RecommendedAction (93-93)

RecommendedAction (143-145)

RecommendedAction (147-149)

RecommendedAction (156-158)

RecommendedAction_CONTACT_SUPPORT (98-98)

ProcessingStrategy (44-44)

ProcessingStrategy (76-78)

ProcessingStrategy (80-82)

ProcessingStrategy (89-91)

ProcessingStrategy_STORE_ONLY (49-49)

RecommendedAction_NONE (96-96)

ProcessingStrategy_EXECUTE_REMEDIATION (48-48)

HealthEvents (160-166)

HealthEvents (179-179)

HealthEvents (194-196)

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)

data-models/pkg/protos/health_event.pb.go (10)

HealthEvents (160-166)

HealthEvents (179-179)

HealthEvents (194-196)

HealthEvent (264-284)

HealthEvent (297-297)

HealthEvent (312-314)

ProcessingStrategy (44-44)

ProcessingStrategy (76-78)

ProcessingStrategy (80-82)

ProcessingStrategy (89-91)

tests/event_exporter_test.go (1)

tests/helpers/event_exporter.go (1)

ValidateCloudEvent (221-247)

health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (1)

data-models/pkg/protos/health_event.pb.go (2)

ProcessingStrategy_STORE_ONLY (49-49)

ProcessingStrategy_EXECUTE_REMEDIATION (48-48)

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)

data-models/pkg/protos/health_event.pb.go (2)

ProcessingStrategy_STORE_ONLY (49-49)

ProcessingStrategy_EXECUTE_REMEDIATION (48-48)

🪛 markdownlint-cli2 (0.18.1)

docs/designs/025-processing-strategy-for-health-checks.md

350-350: Hard tabs
Column: 1

(MD010, no-hard-tabs)

351-351: Hard tabs
Column: 1

(MD010, no-hard-tabs)

353-353: Hard tabs
Column: 1

(MD010, no-hard-tabs)

355-355: Hard tabs
Column: 1

(MD010, no-hard-tabs)

356-356: Hard tabs
Column: 1

(MD010, no-hard-tabs)

357-357: Hard tabs
Column: 1

(MD010, no-hard-tabs)

358-358: Hard tabs
Column: 1

(MD010, no-hard-tabs)

359-359: Hard tabs
Column: 1

(MD010, no-hard-tabs)

360-360: Hard tabs
Column: 1

(MD010, no-hard-tabs)

362-362: Hard tabs
Column: 1

(MD010, no-hard-tabs)

372-372: Hard tabs
Column: 1

(MD010, no-hard-tabs)

373-373: Hard tabs
Column: 1

(MD010, no-hard-tabs)

374-374: Hard tabs
Column: 1

(MD010, no-hard-tabs)

383-383: Hard tabs
Column: 1

(MD010, no-hard-tabs)

436-436: Hard tabs
Column: 1

(MD010, no-hard-tabs)

437-437: Hard tabs
Column: 1

(MD010, no-hard-tabs)

438-438: Hard tabs
Column: 1

(MD010, no-hard-tabs)

439-439: Hard tabs
Column: 1

(MD010, no-hard-tabs)

440-440: Hard tabs
Column: 1

(MD010, no-hard-tabs)

441-441: Hard tabs
Column: 1

(MD010, no-hard-tabs)

442-442: Hard tabs
Column: 1

(MD010, no-hard-tabs)

443-443: Hard tabs
Column: 1

(MD010, no-hard-tabs)

444-444: Hard tabs
Column: 1

(MD010, no-hard-tabs)

445-445: Hard tabs
Column: 1

(MD010, no-hard-tabs)

446-446: Hard tabs
Column: 1

(MD010, no-hard-tabs)

447-447: Hard tabs
Column: 1

(MD010, no-hard-tabs)

448-448: Hard tabs
Column: 1

(MD010, no-hard-tabs)

449-449: Hard tabs
Column: 1

(MD010, no-hard-tabs)

457-457: Hard tabs
Column: 1

(MD010, no-hard-tabs)

458-458: Hard tabs
Column: 1

(MD010, no-hard-tabs)

459-459: Hard tabs
Column: 1

(MD010, no-hard-tabs)

460-460: Hard tabs
Column: 1

(MD010, no-hard-tabs)

461-461: Hard tabs
Column: 1

(MD010, no-hard-tabs)

462-462: Hard tabs
Column: 1

(MD010, no-hard-tabs)

463-463: Hard tabs
Column: 1

(MD010, no-hard-tabs)

464-464: Hard tabs
Column: 1

(MD010, no-hard-tabs)

465-465: Hard tabs
Column: 1

(MD010, no-hard-tabs)

466-466: Hard tabs
Column: 1

(MD010, no-hard-tabs)

467-467: Hard tabs
Column: 1

(MD010, no-hard-tabs)

484-484: Hard tabs
Column: 1

(MD010, no-hard-tabs)

485-485: Hard tabs
Column: 1

(MD010, no-hard-tabs)

486-486: Hard tabs
Column: 1

(MD010, no-hard-tabs)

487-487: Hard tabs
Column: 1

(MD010, no-hard-tabs)

488-488: Hard tabs
Column: 1

(MD010, no-hard-tabs)

489-489: Hard tabs
Column: 1

(MD010, no-hard-tabs)

490-490: Hard tabs
Column: 1

(MD010, no-hard-tabs)

491-491: Hard tabs
Column: 1

(MD010, no-hard-tabs)

492-492: Hard tabs
Column: 1

(MD010, no-hard-tabs)

493-493: Hard tabs
Column: 1

(MD010, no-hard-tabs)

494-494: Hard tabs
Column: 1

(MD010, no-hard-tabs)

495-495: Hard tabs
Column: 1

(MD010, no-hard-tabs)

496-496: Hard tabs
Column: 1

(MD010, no-hard-tabs)

497-497: Hard tabs
Column: 1

(MD010, no-hard-tabs)

505-505: Hard tabs
Column: 1

(MD010, no-hard-tabs)

506-506: Hard tabs
Column: 1

(MD010, no-hard-tabs)

507-507: Hard tabs
Column: 1

(MD010, no-hard-tabs)

508-508: Hard tabs
Column: 1

(MD010, no-hard-tabs)

509-509: Hard tabs
Column: 1

(MD010, no-hard-tabs)

510-510: Hard tabs
Column: 1

(MD010, no-hard-tabs)

511-511: Hard tabs
Column: 1

(MD010, no-hard-tabs)

512-512: Hard tabs
Column: 1

(MD010, no-hard-tabs)

513-513: Hard tabs
Column: 1

(MD010, no-hard-tabs)

514-514: Hard tabs
Column: 1

(MD010, no-hard-tabs)

515-515: Hard tabs
Column: 1

(MD010, no-hard-tabs)

594-594: Hard tabs
Column: 1

(MD010, no-hard-tabs)

595-595: Hard tabs
Column: 1

(MD010, no-hard-tabs)

596-596: Hard tabs
Column: 1

(MD010, no-hard-tabs)

597-597: Hard tabs
Column: 1

(MD010, no-hard-tabs)

598-598: Hard tabs
Column: 1

(MD010, no-hard-tabs)

599-599: Hard tabs
Column: 1

(MD010, no-hard-tabs)

600-600: Hard tabs
Column: 1

(MD010, no-hard-tabs)

601-601: Hard tabs
Column: 1

(MD010, no-hard-tabs)

602-602: Hard tabs
Column: 1

(MD010, no-hard-tabs)

603-603: Hard tabs
Column: 1

(MD010, no-hard-tabs)

604-604: Hard tabs
Column: 1

(MD010, no-hard-tabs)

605-605: Hard tabs
Column: 1

(MD010, no-hard-tabs)

🔇 Additional comments (41)

tests/helpers/fault_quarantine.go (1)

141-145: LGTM!

The conditional guard is well-placed. The ConfigMap backup at lines 136-139 still executes unconditionally, ensuring teardown restoration works correctly regardless of whether a new ConfigMap was applied. This cleanly supports tests that need the setup infrastructure (node selection, deployment options) without requiring a custom ConfigMap, aligning with the STORE_ONLY processing strategy testing scenarios.

health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

20-21: LGTM!

The new processingStrategy field is appropriately added as an unexported field alongside the existing handler configuration. The import alias pb follows the project's convention for protobuf imports.

Also applies to: 34-34

health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go (1)

53-53: LGTM!

The processingStrategy field is properly added to the SyslogMonitor struct, following the same pattern as other configuration fields and consistent with the handler structs.

tests/fault_quarantine_test.go (3)

234-244: LGTM! Well-structured test for processing strategy behavior.

The test properly validates the behavioral difference between STORE_ONLY (observability-only) and EXECUTE_REMEDIATION (full quarantine actions) processing strategies using the e2e-framework features pattern.

246-282: Good negative test coverage for STORE_ONLY strategy.

The test correctly verifies that STORE_ONLY events are recorded without triggering node conditions, events, or quarantine actions, validating the observability-only behavior.

284-325: Good positive test coverage for EXECUTE_REMEDIATION strategy.

The test correctly verifies that EXECUTE_REMEDIATION events trigger node conditions, events, and quarantine actions as expected.

tests/helpers/syslog-health-monitor.go (1)

32-71: LGTM! Well-structured setup helper.

The helper properly orchestrates the setup sequence: updating DaemonSet args, locating the pod, injecting metadata, establishing port-forwarding, and configuring node labels. The function returns all necessary handles for teardown.

tests/helpers/kube.go (5)

387-409: LGTM - Well-structured event absence assertion.

The function correctly uses require.Never to assert that a node event with specific type and reason never appears within the timeout period. The helper pattern matches existing functions like WaitForNoRebootNodeCR.

2232-2273: LGTM - Correct DaemonSet rollout verification.

The function properly checks all three rollout completion conditions: desired pods scheduled, all pods updated, and all pods ready. The unexported status is appropriate for an internal helper.

2372-2400: LGTM - Handles both arg styles correctly.

The function properly handles both --flag=value and --flag value argument styles, with appropriate logic for updating existing values.

2402-2450: LGTM - Arg manipulation helpers are well-implemented.

Both functions correctly handle argument manipulation with proper handling of both --flag=value and --flag value patterns.

2452-2496: LGTM - Robust pod retrieval with proper readiness checks.

The function correctly waits for a pod that is not being deleted, is in Running phase, and has all containers ready. The delegation to GetPodOnWorkerNode for initial retrieval is appropriate.

docs/designs/025-processing-strategy-for-health-checks.md (3)

32-40: LGTM - Clear enum definition with appropriate values.

The ProcessingStrategy enum is well-defined with:

UNSPECIFIED = 0 as proto3 default

EXECUTE_REMEDIATION = 1 as the active processing mode

STORE_ONLY = 2 for observability-only mode

The descriptions in lines 33-35 clearly explain each value's purpose.

342-364: LGTM - Normalization strategy is sound.

The normalization of UNSPECIFIED to EXECUTE_REMEDIATION early in the gRPC handler ensures consistent behavior for legacy or custom monitors that don't set this field. This provides a safe default while maintaining backward compatibility.

582-610: LGTM - Backward compatibility approach is well-designed.

The use of $or to match both EXECUTE_REMEDIATION and missing processingstrategy field ensures historical events created before this feature are treated correctly. The comment on line 582-583 clearly explains the rationale.

data-models/protobufs/health_event.proto (2)

32-40: LGTM - Well-documented enum with appropriate values.

The ProcessingStrategy enum follows proto3 best practices:

UNSPECIFIED = 0 as the default value

Clear documentation for each value

Explains the normalization behavior for UNSPECIFIED

79-79: LGTM - Field addition is wire-compatible.

Field number 16 is correctly chosen (sequentially after drainOverrides = 15). The new field is backward-compatible since proto3 treats missing fields as the default value (0 = UNSPECIFIED), which the platform connector normalizes to EXECUTE_REMEDIATION.

platform-connectors/pkg/server/platform_connector_server.go (1)

57-62: LGTM - Correct normalization placement.

The normalization loop is correctly placed before both pipeline processing and ring buffer enqueueing. This ensures:

All events stored in the database have an explicit strategy

All downstream consumers receive normalized events

Custom monitors without the field get safe defaults

platform-connectors/pkg/server/platform_connector_server_test.go (1)

25-67: Well-structured table-driven test covering all normalization scenarios.

The test correctly validates:

UNSPECIFIED → EXECUTE_REMEDIATION normalization

EXECUTE_REMEDIATION preserved

STORE_ONLY preserved

The test verifies in-place mutation of the healthEvents object, which matches the implementation behavior.

One consideration: The retrieved learning from this repository suggests avoiding testify for simple equality checks. However, since testify is already established in the library context for this package, this usage is acceptable.

tests/event_exporter_test.go (2)

25-26: LGTM!

Import placement is correct with standard library imports grouped separately.

85-85: LGTM!

The ValidateCloudEvent call correctly includes the expected "EXECUTE_REMEDIATION" processing strategy, which aligns with the default behavior for health events that should trigger remediation.

tests/helpers/event_exporter.go (1)

220-246: LGTM!

The updated ValidateCloudEvent function correctly adds the expectedProcessingStrategy parameter and validates it against healthEvent["processingStrategy"]. The signature extension follows the existing parameter ordering pattern.

store-client/pkg/client/pipeline_builder.go (1)

35-47: LGTM!

The new interface methods are well-documented and follow the established naming convention. The comments clearly explain:

BuildProcessableHealthEventInsertsPipeline is for fault-quarantine to ignore STORE_ONLY events

BuildProcessableNonFatalUnhealthyInsertsPipeline is for health-events-analyzer pattern analysis

Both methods appropriately extend the PipelineBuilder interface for the new processing strategy filtering.

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (2)

95-110: LGTM!

The ProcessingStrategy field is correctly set on the HealthEvent from the handler's stored value. This ensures the strategy propagates through the health event pipeline for downstream filtering.

30-42: LGTM!

The constructor correctly accepts and stores the processingStrategy parameter. All callers—including the test cases and the production caller in syslogmonitor.go—have been properly updated to pass the new parameter.

store-client/pkg/client/mongodb_pipeline_builder.go (3)

17-21: LGTM!

Import of protos package added correctly for accessing ProcessingStrategy_EXECUTE_REMEDIATION constant.

129-156: LGTM with same consideration as above.

The implementation correctly extends BuildNonFatalUnhealthyInsertsPipeline with processing strategy filtering while maintaining the existing agent and ishealthy conditions. The $or clause for backward compatibility follows the same pattern as BuildProcessableHealthEventInsertsPipeline.

87-113: The pipeline code is correct as written. UNSPECIFIED enum values will never appear in MongoDB because the platform-connector normalizes them to EXECUTE_REMEDIATION before storage (see platform-connectors/pkg/server/platform_connector_server.go lines 57-61). Therefore, the $or condition correctly handles only the two cases that can exist in the database:

EXECUTE_REMEDIATION (1) for new events

Missing/null field for old events created before upgrade

No additional conditions are needed; $exists: false is the appropriate check for backward compatibility.

Likely an incorrect or invalid review comment.

health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (2)

165-191: Good test coverage for ProcessingStrategy propagation.

The test correctly validates that the ProcessingStrategy passed to NewGPUFallenHandler is properly propagated to the generated HealthEvent. The table-driven approach with validateEvent callbacks allows for flexible event validation.

221-242: LGTM - Cross-validation of handler strategy with event output.

The test at line 241 properly validates that the handler's configured processingStrategy is correctly propagated to the generated event, providing good coverage for the strategy propagation path.

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)

1550-1635: Well-structured test with proper isolation.

The test correctly:

Creates isolated fake clients per test case

Properly filters NVSentinel-specific conditions from standard K8s conditions

Validates both node conditions and Kubernetes events based on ProcessingStrategy

Uses descriptive logging for debugging

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)

309-334: LGTM - Good coverage for ProcessingStrategy initialization.

The test correctly validates that:

NewSyslogMonitor accepts and stores ProcessingStrategy_STORE_ONLY

NewSyslogMonitorWithFactory accepts and stores ProcessingStrategy_EXECUTE_REMEDIATION

Both constructor variants are tested with different strategies, providing good coverage for the new parameter.

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (1)

143-196: Good addition of topology-aware SXID test with ProcessingStrategy.

This new test provides valuable coverage for:

SXID processing with valid NVSwitch topology metadata

ProcessingStrategy propagation to events

GPU entity lookup from NVLink topology

The temporary file setup and cleanup via t.TempDir() is clean and follows best practices.

tests/syslog_health_monitor_test.go (1)

361-420: Well-structured E2E test for STORE_ONLY strategy.

The test correctly:

Sets up the syslog monitor with --processing-strategy: STORE_ONLY

Injects XID messages and verifies no node condition is created

Verifies the node is not cordoned (quarantine state unchanged)

Resets to EXECUTE_REMEDIATION in teardown to restore the default behavior for other tests

The use of helpers.EnsureNodeConditionNotPresent and helpers.AssertQuarantineState provides clear negative assertions.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)

327-345: LGTM! Clean filtering implementation for STORE_ONLY events.

The function correctly filters out STORE_ONLY events with appropriate logging for observability. The log includes relevant context (node, checkName, agent) for debugging.

store-client/pkg/client/postgresql_pipeline_builder.go (2)

119-146: LGTM! Well-documented backward-compatible pipeline.

The pipeline correctly:

Uses lowercase field name processingstrategy (per learnings for MongoDB/PostgreSQL consistency)

Implements backward compatibility via $or for missing/null fields

Matches the EXECUTE_REMEDIATION strategy using the protobuf enum value

Clear documentation explains the upgrade path

163-192: LGTM! Consistent implementation with clear documentation.

The pipeline correctly extends BuildNonFatalUnhealthyInsertsPipeline with processing strategy filtering while maintaining:

Same backward compatibility pattern

Consistent field naming

Clear documentation about health-events-analyzer use case

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (4)

40-58: LGTM! Clean parameter threading through constructor.

The processingStrategy parameter is correctly added to NewSyslogMonitor and forwarded to NewSyslogMonitorWithFactory. The parameter ordering is consistent with the existing pattern.

90-104: LGTM! Correct storage and propagation of processingStrategy.

The processingStrategy is properly stored in the SyslogMonitor struct and will be available for health event generation.

106-141: LGTM! Consistent propagation to all handler types.

All three handler types (XID, SXID, GPUFallen) correctly receive the processingStrategy parameter, ensuring consistent event generation across all check types.

792-820: LGTM! Correct inclusion of ProcessingStrategy in health events.

The ProcessingStrategy field is correctly set from the monitor's stored strategy, ensuring all events generated by prepareHealthEventWithAction carry the configured processing strategy.

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl

platform-connectors/pkg/connectors/kubernetes/process_node_events.go

tests/syslog_health_monitor_test.go

github-actions · 2026-01-12T04:56:08Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes	83.06% (+0.21%)	👍
github.com/nvidia/nvsentinel/tests	0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/tests/helpers/kube.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/tests/helpers/syslog-health-monitor.go	0.00% (ø)	0	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

Signed-off-by: Tanisha goyal <[email protected]>

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @tests/helpers/kube.go:
- Around line 2275-2325: Remove the dead variable and the hard-coded sleep:
delete the unused originalDaemonSet declaration and the assignment to
originalDaemonSet inside UpdateDaemonSetArgs (the DeepCopy where
originalDaemonSet is set) since it is never read, and remove the explicit
time.Sleep(10 * time.Second) after waitForDaemonSetRollout; rely on
waitForDaemonSetRollout to ensure pods are ready (no changes needed to
setArgsOnContainer or the client update logic).

🧹 Nitpick comments (5)

health-monitors/syslog-health-monitor/main.go (1)
164-171: Consider rejecting UNSPECIFIED as a valid input.

The validation uses pb.ProcessingStrategy_value which includes UNSPECIFIED (value 0), but the flag description only lists EXECUTE_REMEDIATION and STORE_ONLY as valid values. If a user passes --processing-strategy=UNSPECIFIED, it would be accepted silently.
🔧 Suggested fix to explicitly reject UNSPECIFIED
 	value, ok := pb.ProcessingStrategy_value[*processingStrategyFlag]
 	if !ok {
 		return fmt.Errorf("unexpected processingStrategy value: %q", *processingStrategyFlag)
 	}
+	if value == int32(pb.ProcessingStrategy_UNSPECIFIED) {
+		return fmt.Errorf("processingStrategy cannot be UNSPECIFIED, use EXECUTE_REMEDIATION or STORE_ONLY")
+	}

 	slog.Info("Event handling strategy configured", "processingStrategy", *processingStrategyFlag)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (1)
427-446: TestNewXIDHandler updated for new constructor signature.

The test correctly passes the processingStrategy parameter. Consider adding an assertion to verify handler.processingStrategy matches the input value for completeness.
💡 Optional: Add assertion for processingStrategy field
 				assert.NotNil(t, handler.pciToGPUUUID)
 				assert.NotNil(t, handler.parser)
 				assert.NotNil(t, handler.metadataReader)
+				assert.Equal(t, pb.ProcessingStrategy_EXECUTE_REMEDIATION, handler.processingStrategy)
 			}
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (1)
63-75: Consider using an options pattern for constructor readability.

The NewSyslogMonitorWithFactory function now has 11 parameters, which can be challenging to maintain. A functional options pattern or a config struct could improve readability and make future additions easier.
💡 Example: Config struct pattern
type SyslogMonitorConfig struct {
    NodeName              string
    Checks                []CheckDefinition
    PcClient              pb.PlatformConnectorClient
    DefaultAgentName      string
    DefaultComponentClass string
    PollingInterval       string
    StateFilePath         string
    JournalFactory        JournalFactory
    XidAnalyserEndpoint   string
    MetadataPath          string
    ProcessingStrategy    pb.ProcessingStrategy
}

func NewSyslogMonitorWithConfig(cfg SyslogMonitorConfig) (*SyslogMonitor, error) {
    // ...
}
This is a nice-to-have improvement that could be addressed in a follow-up refactor.
tests/helpers/kube.go (2)
2452-2496: Parameter daemonsetName is unused for filtering.

The daemonsetName parameter is only used in the error message (line 2492) but not for actually filtering or validating that the found pod belongs to the specified DaemonSet. This could lead to returning pods from other DaemonSets if podNamePattern is not specific enough.

Consider either:

Using daemonsetName to verify the pod's owner reference, or

Removing the parameter if podNamePattern is sufficient for filtering.
♻️ Option: Verify pod ownership (if validation is needed)
 func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client,
 	daemonsetName string, podNamePattern string) (*v1.Pod, error) {
 	t.Helper()
 
 	var resultPod *v1.Pod
 
 	require.Eventually(t, func() bool {
 		// Get the pod
 		pod, err := GetPodOnWorkerNode(ctx, t, client, NVSentinelNamespace, podNamePattern)
 		if err != nil {
 			t.Logf("Failed to get pod: %v", err)
 			return false
 		}
 
+		// Verify pod belongs to the expected DaemonSet
+		belongsToDaemonSet := false
+		for _, ownerRef := range pod.OwnerReferences {
+			if ownerRef.Kind == "DaemonSet" && ownerRef.Name == daemonsetName {
+				belongsToDaemonSet = true
+				break
+			}
+		}
+		if !belongsToDaemonSet {
+			t.Logf("Pod %s does not belong to DaemonSet %s", pod.Name, daemonsetName)
+			return false
+		}
+
 		// Verify pod is not being deleted
2327-2330: Missing godoc comment for exported function.

Per coding guidelines, exported Go functions require function comments. RemoveDaemonSetArgs is missing a godoc comment explaining its purpose.
📝 Add godoc comment
+// RemoveDaemonSetArgs removes the specified arguments from a DaemonSet container
+// and waits for the rollout to complete.
 func RemoveDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client,
 	daemonsetName string,
 	containerName string, args map[string]string,
 ) error {

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7952551 and d8972e9.

⛔ Files ignored due to path filters (1)

data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (21)

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
health-monitors/syslog-health-monitor/main.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
health-monitors/syslog-health-monitor/pkg/gpufallen/types.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
health-monitors/syslog-health-monitor/pkg/sxid/types.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go
health-monitors/syslog-health-monitor/pkg/xid/types.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
tests/event_exporter_test.go
tests/helpers/event_exporter.go
tests/helpers/kube.go
tests/helpers/syslog-health-monitor.go
tests/syslog_health_monitor_test.go

🚧 Files skipped from review as they are similar to previous changes (7)

health-monitors/syslog-health-monitor/pkg/xid/types.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
health-monitors/syslog-health-monitor/pkg/gpufallen/types.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
tests/helpers/syslog-health-monitor.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl

🧰 Additional context used

📓 Path-based instructions (3)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
tests/event_exporter_test.go
tests/syslog_health_monitor_test.go
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
health-monitors/syslog-health-monitor/main.go
tests/helpers/event_exporter.go
tests/helpers/kube.go
health-monitors/syslog-health-monitor/pkg/sxid/types.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

tests/event_exporter_test.go
tests/syslog_health_monitor_test.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go

**/values.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml

🧠 Learnings (14)

📓 Common learnings

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

tests/event_exporter_test.go
tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

tests/event_exporter_test.go
tests/syslog_health_monitor_test.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

tests/event_exporter_test.go
tests/syslog_health_monitor_test.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

tests/event_exporter_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-07T04:16:43.079Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

tests/syslog_health_monitor_test.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
health-monitors/syslog-health-monitor/main.go
health-monitors/syslog-health-monitor/pkg/sxid/types.go

📚 Learning: 2026-01-12T05:13:19.057Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:19.057Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.

Applied to files:

tests/syslog_health_monitor_test.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
health-monitors/syslog-health-monitor/pkg/sxid/types.go

📚 Learning: 2026-01-07T09:54:49.335Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go

📚 Learning: 2025-12-23T10:34:13.121Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
tests/helpers/event_exporter.go
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation

Applied to files:

tests/helpers/kube.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label

Applied to files:

tests/helpers/kube.go

📚 Learning: 2025-12-22T16:16:31.660Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

tests/helpers/kube.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`

Applied to files:

health-monitors/syslog-health-monitor/pkg/sxid/types.go

🧬 Code graph analysis (8)

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (2)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (45-45)

ProcessingStrategy (77-79)

ProcessingStrategy (81-83)

ProcessingStrategy (90-92)

health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

XIDHandler (29-39)

tests/event_exporter_test.go (1)

tests/helpers/event_exporter.go (1)

ValidateCloudEvent (221-247)

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (3)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (45-45)

ProcessingStrategy (77-79)

ProcessingStrategy (81-83)

ProcessingStrategy (90-92)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-18)

health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)

SXIDHandler (29-36)

health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (4)

data-models/pkg/protos/health_event.pb.go (6)

ProcessingStrategy_EXECUTE_REMEDIATION (49-49)

ProcessingStrategy_STORE_ONLY (50-50)

ProcessingStrategy (45-45)

ProcessingStrategy (77-79)

ProcessingStrategy (81-83)

ProcessingStrategy (90-92)

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)

NewXIDHandler (33-58)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-18)

health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

XIDHandler (29-39)

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)

data-models/pkg/protos/health_event.pb.go (2)

ProcessingStrategy_STORE_ONLY (50-50)

ProcessingStrategy_EXECUTE_REMEDIATION (49-49)

health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (1)

data-models/pkg/protos/health_event.pb.go (2)

ProcessingStrategy_STORE_ONLY (50-50)

ProcessingStrategy_EXECUTE_REMEDIATION (49-49)

health-monitors/syslog-health-monitor/main.go (1)

data-models/pkg/protos/health_event.pb.go (1)

ProcessingStrategy_value (60-64)

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (3)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (45-45)

ProcessingStrategy (77-79)

ProcessingStrategy (81-83)

ProcessingStrategy (90-92)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-18)

health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go (1)

SyslogMonitor (47-67)

🔇 Additional comments (30)

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml (1)

53-59: LGTM! Well-documented configuration option.

The new processingStrategy value is properly documented with inline comments explaining both valid values and their behavioral differences. This follows the Helm chart documentation guidelines.

tests/event_exporter_test.go (1)

85-85: LGTM!

The updated call correctly passes "EXECUTE_REMEDIATION" to validate the expected processing strategy in the CloudEvent, aligning with the new helper function signature and the default strategy.

tests/helpers/event_exporter.go (1)

220-246: LGTM!

The helper function is properly extended to validate the processingStrategy field in CloudEvents. The assertion at line 245 correctly uses require.Equal for exact string matching, consistent with the other field validations in this function.

tests/syslog_health_monitor_test.go (2)

361-420: LGTM! Good test coverage for STORE_ONLY strategy.

The test correctly validates the STORE_ONLY behavior:

Setup with --processing-strategy: STORE_ONLY

Verifies no node condition is created after XID injection

Confirms node quarantine state is unchanged

Teardown uses EXECUTE_REMEDIATION to ensure proper cleanup (since STORE_ONLY won't modify cluster state)

The test follows the descriptive naming format per coding guidelines.

159-170: Consistent teardown pattern applied.

The centralized TearDownSyslogHealthMonitor helper improves test maintainability. The pattern of extracting context values and passing them to the helper is applied consistently across the updated tests.

health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)

29-36: LGTM!

The processingStrategy field is correctly added as an unexported field using the protobuf enum type pb.ProcessingStrategy. This follows the existing struct pattern and ensures type safety for the processing strategy propagation.

health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go (1)

46-67: LGTM!

The processingStrategy field is correctly added to SyslogMonitor struct, positioned logically with other configuration fields. The field follows the same pattern as other handler types in the PR.

health-monitors/syslog-health-monitor/main.go (2)

70-71: LGTM! Clear flag definition.

The flag description correctly documents the two valid user-facing values (EXECUTE_REMEDIATION or STORE_ONLY) with an appropriate default.

173-184: LGTM! Proper wiring to NewSyslogMonitor.

The validated and converted processingStrategy is correctly passed to the NewSyslogMonitor constructor, completing the configuration flow from CLI flag to monitor initialization.

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (2)

30-42: Constructor update for processingStrategy looks correct.

The new processingStrategy parameter is properly accepted and stored in the handler struct. The signature change is consistent with the pattern used across other handlers (XID, GPUFallen).

95-110: ProcessingStrategy correctly propagated to HealthEvent.

The processingStrategy field is properly included in the HealthEvent construction, ensuring events carry the configured strategy for downstream processing.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (2)

33-57: Constructor properly updated with processingStrategy parameter.

The new parameter is correctly positioned, stored in the handler struct, and follows the same pattern as the SXID handler. The implementation is consistent across all handlers in this PR.

175-190: HealthEvent correctly includes ProcessingStrategy.

The processingStrategy is properly propagated to the health event, ensuring consistent behavior with other handlers.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (3)

153-195: Good test coverage for ProcessingStrategy propagation.

The test correctly uses pb.ProcessingStrategy_STORE_ONLY and validates that the generated event carries the same value (Line 194). This ensures the strategy is properly propagated through the handler.

197-232: Test validates EXECUTE_REMEDIATION strategy.

This test case complements the previous one by testing pb.ProcessingStrategy_EXECUTE_REMEDIATION, providing coverage for both primary strategy values.

306-350: TestCreateHealthEventFromResponse validates strategy in health event construction.

The test verifies that ProcessingStrategy is included in the generated health event. The assertion on Line 348 confirms the expected behavior.

health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (3)

165-190: TestProcessLine correctly validates ProcessingStrategy.

The test uses pb.ProcessingStrategy_STORE_ONLY and verifies the strategy is propagated to the generated event (Line 185). Good coverage of the happy path.

221-242: Direct field access for ProcessingStrategy assertion.

Line 241 uses handler2.processingStrategy directly, which is valid since this is a test within the same package. This approach correctly validates that the event's ProcessingStrategy matches the handler's configured value.

193-219: XID tracking tests updated with ProcessingStrategy.

The XID tracking tests correctly pass the processingStrategy parameter to the handler constructor. The test logic for suppression behavior remains intact.

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (3)

309-317: Test validates ProcessingStrategy storage in SyslogMonitor.

Line 317 correctly asserts that the processingStrategy is stored in the monitor. This ensures the value is properly passed through and retained.

329-334: NewSyslogMonitorWithFactory tested with EXECUTE_REMEDIATION.

Good coverage of both constructor paths with different strategy values, ensuring both NewSyslogMonitor and NewSyslogMonitorWithFactory correctly handle the new parameter.

391-403: TestJournalProcessingLogic updated with processingStrategy.

The test correctly passes the strategy to the factory constructor. The test focuses on journal cursor handling, which is orthogonal to the strategy propagation.

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (4)

40-57: NewSyslogMonitor correctly accepts and forwards processingStrategy.

The parameter is properly added and forwarded to NewSyslogMonitorWithFactory. The implementation maintains backward compatibility with the factory pattern.

63-104: NewSyslogMonitorWithFactory stores and propagates processingStrategy.

The strategy is correctly stored in the SyslogMonitor struct (Line 96) and will be used when creating handler instances. The initialization is consistent with existing patterns.

106-141: Handler creation consistently passes processingStrategy.

All three handler types (XID, SXID, GPUFallen) receive the processingStrategy parameter, ensuring consistent behavior across all check types. The error handling for handler initialization is appropriate.

792-820: prepareHealthEventWithAction includes ProcessingStrategy in health events.

Line 813 correctly sets ProcessingStrategy: sm.processingStrategy in the generated health event. This ensures events created via this path (e.g., healthy events after reboot on Line 312) carry the configured strategy.

Based on learnings, healthy events with EXECUTE_REMEDIATION are valid when the Fault Quarantine Manager needs to act on them to clear previous fault states.

tests/helpers/kube.go (4)

2232-2273: LGTM!

The waitForDaemonSetRollout function correctly waits for a DaemonSet rollout to complete by verifying that all desired pods are scheduled, updated, and ready. The logging is helpful for debugging test failures.

2370-2400: LGTM - logic is correct.

The function properly handles both --flag=value and --flag value argument styles. The slice manipulation on line 2392 is complex but correct.

2402-2424: LGTM!

The function correctly sets or updates container arguments, handling both new and existing flags.

2426-2450: LGTM!

The function correctly removes container arguments, handling both --flag=value and --flag value styles.

tests/helpers/kube.go

github-actions · 2026-01-12T09:07:10Z

🛡️ CodeQL Analysis

🚨 Found 1 security alert(s)

🔗 View details

Signed-off-by: Tanisha goyal <[email protected]>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @tests/helpers/syslog-health-monitor.go:
- Line 15: Add a package-level godoc comment for the helpers package by
inserting a comment like "// Package helpers ...", placed immediately before the
"package helpers" declaration (you can add it in
tests/helpers/syslog-health-monitor.go or create a new doc.go); ensure the
comment briefly describes the package purpose and follows Go doc convention
starting with "Package helpers".

🧹 Nitpick comments (5)

tests/helpers/kube.go (4)
2315-2316: Consider replacing fixed sleep with a more deterministic wait.

The 10-second sleep after rollout completion is a workaround that may cause flaky tests or unnecessary delays. Consider polling for pod readiness or container startup instead.

However, since this is a test helper and the rollout wait already ensures pods are ready, this may be acceptable if it addresses timing issues with the syslog health monitor initialization.

2321-2360: Inconsistent error handling between UpdateDaemonSetArgs and RemoveDaemonSetArgs.

UpdateDaemonSetArgs returns an error, while RemoveDaemonSetArgs uses require.NoError directly and returns nothing. This inconsistency makes the API harder to use and understand.

Consider aligning the signatures for consistency:
♻️ Suggested fix
-func RemoveDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client,
+func RemoveDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client,
 	daemonsetName string,
 	containerName string, args map[string]string,
-) {
+) error {
 	t.Helper()
 
 	t.Logf("Removing args %v from daemonset %s/%s", args, NVSentinelNamespace, daemonsetName)
 
 	err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		// ... existing code ...
 	})
-	require.NoError(t, err, "failed to remove args from daemonset %s/%s", NVSentinelNamespace, daemonsetName)
+	if err != nil {
+		return fmt.Errorf("failed to remove args from daemonset %s/%s: %w", NVSentinelNamespace, daemonsetName, err)
+	}
 
 	t.Logf("Waiting for daemonset %s/%s rollout to complete after restoration", NVSentinelNamespace, daemonsetName)
 	waitForDaemonSetRollout(ctx, t, client, daemonsetName)
 
 	t.Log("DaemonSet restored successfully")
+	return nil
 }
2418-2442: Parameter type could be simplified.

The args parameter is map[string]string but only the keys are used (values are ignored via range args). Consider using []string for clarity, or document that only keys matter.

This is a minor inconsistency with setArgsOnContainer which uses both keys and values.

2444-2488: Unused daemonsetName parameter.

The daemonsetName parameter is only used in the error message at line 2484. The actual pod lookup uses podNamePattern via GetPodOnWorkerNode. Consider either:

Removing the unused parameter

Using it to verify the pod belongs to the expected DaemonSet via owner references
♻️ Option 2: Verify DaemonSet ownership
// After getting the pod, verify it belongs to the expected DaemonSet
for _, ownerRef := range pod.OwnerReferences {
    if ownerRef.Kind == "DaemonSet" && ownerRef.Name == daemonsetName {
        // Pod belongs to expected DaemonSet
        break
    }
}
tests/helpers/syslog-health-monitor.go (1)
32-34: Function comments should follow Go conventions.

Per coding guidelines, exported function comments should start with the function name. As per coding guidelines, function comments are required for all exported Go functions.
♻️ Suggested fix
-// helper function to set up syslog health monitor and port forward to it
+// SetUpSyslogHealthMonitor sets up the syslog health monitor for tests.
+// If args is provided, it updates the DaemonSet with the specified arguments.
+// Returns the test node name, the syslog pod, and a stop channel for the port-forward.
 func SetUpSyslogHealthMonitor(ctx context.Context, t *testing.T,
 	client klient.Client, args map[string]string) (string, *v1.Pod, chan struct{}) {

-// helper function to roll back syslog health monitor daemonset and stop the port forward
+// TearDownSyslogHealthMonitor cleans up the syslog health monitor test environment.
+// It stops the port-forward, removes DaemonSet args if provided, restarts the pod,
+// cleans up metadata, and removes the ManagedByNVSentinel label.
 func TearDownSyslogHealthMonitor(ctx context.Context, t *testing.T, client klient.Client,
Also applies to: 73-76

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d8972e9 and 49d8021.

📒 Files selected for processing (4)

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
tests/helpers/kube.go
tests/helpers/syslog-health-monitor.go
tests/syslog_health_monitor_test.go

🧰 Additional context used

📓 Path-based instructions (2)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

tests/syslog_health_monitor_test.go
tests/helpers/syslog-health-monitor.go
tests/helpers/kube.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

tests/syslog_health_monitor_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go

🧠 Learnings (11)

📓 Common learnings

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

tests/syslog_health_monitor_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go

📚 Learning: 2025-11-07T04:16:43.079Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2026-01-12T05:13:19.057Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:19.057Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.

Applied to files:

tests/syslog_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label

Applied to files:

tests/helpers/kube.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation

Applied to files:

tests/helpers/kube.go

📚 Learning: 2025-12-22T16:16:31.660Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

tests/helpers/kube.go

📚 Learning: 2025-12-23T10:34:13.121Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go

🧬 Code graph analysis (2)

tests/syslog_health_monitor_test.go (3)

tests/helpers/syslog-health-monitor.go (2)

SetUpSyslogHealthMonitor (33-71)

StubJournalHTTPPort (27-27)

tests/helpers/kube.go (2)

InjectSyslogMessages (1820-1843)

PortForwardPod (1614-1658)

tests/helpers/fault_quarantine.go (2)

AssertQuarantineState (317-384)

QuarantineAssertion (56-60)

tests/helpers/syslog-health-monitor.go (3)

tests/helpers/kube.go (8)

UpdateDaemonSetArgs (2276-2319)

GetDaemonSetPodOnWorkerNode (2444-2488)

NVSentinelNamespace (64-64)

PortForwardPod (1614-1658)

RemoveDaemonSetArgs (2321-2360)

EventuallyWaitTimeout (61-61)

WaitInterval (63-63)

RemoveNodeManagedByNVSentinelLabel (1411-1425)

tests/helpers/metadata.go (3)

CreateTestMetadata (59-108)

InjectMetadata (110-181)

DeleteMetadata (183-229)

commons/pkg/auditlogger/auditlogger.go (1)

Log (114-134)

🔇 Additional comments (14)

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)

1606-1614: LGTM! Good defensive coding pattern.

Using require.NotEmpty before accessing nvsentinelConditions[0] prevents potential index out of bounds panics if the slice is unexpectedly empty. The require (vs assert) correctly stops test execution immediately, preventing the subsequent [0] access. Using assert.Empty for the negative case is cleaner and more idiomatic.

1622-1630: LGTM! Consistent safe access pattern.

Same defensive pattern correctly applied to Kubernetes events assertions - require.NotEmpty guards the subsequent events.Items[0] access.

tests/helpers/kube.go (3)

2232-2273: LGTM! Well-structured DaemonSet rollout helper.

The waitForDaemonSetRollout function correctly checks all necessary conditions for a complete rollout: DesiredNumberScheduled > 0, UpdatedNumberScheduled matches desired, and NumberReady matches desired. Good progress logging for debugging.

2362-2392: LGTM! Comprehensive argument update logic.

The function correctly handles both --flag=value and --flag value argument styles. The slice manipulation for inserting a value after a standalone flag is correct.

2394-2416: LGTM!

Clear logic for setting container arguments with proper handling of both existing and new arguments.

tests/syslog_health_monitor_test.go (5)

25-26: LGTM!

Import correctly references the helpers package with appropriate alias.

47-58: LGTM!

Setup correctly uses the new helper function with nil args (default strategy) and properly stores context values for teardown.

184-215: LGTM!

This test intentionally uses manual setup instead of SetUpSyslogHealthMonitor because it specifically tests behavior without metadata injection. The setup correctly:

Gets the syslog pod directly

Sets up port-forward manually

Sets ManagedByNVSentinel=false without injecting metadata

361-420: LGTM! Well-structured test for STORE_ONLY strategy.

The test correctly:

Sets up with --processing-strategy=STORE_ONLY via helper args

Injects XID errors that would normally trigger conditions

Verifies no node condition is created (EnsureNodeConditionNotPresent)

Verifies node was not cordoned (AssertQuarantineState with ExpectCordoned: false)

Tears down by removing the same args that were added

This validates the STORE_ONLY processing strategy correctly prevents remediation actions.

159-170: LGTM!

Teardown correctly passes nil for args since the setup didn't modify DaemonSet arguments. The helper will skip the RemoveDaemonSetArgs call.

tests/helpers/syslog-health-monitor.go (4)

26-30: LGTM!

Constants are well-defined and match the expected DaemonSet and container names. StubJournalHTTPPort = 9091 is correctly exported for use in tests.

32-71: LGTM! Well-designed setup helper.

The function:

Conditionally updates DaemonSet args (for strategy testing)

Gets the pod from the current rollout on a worker node

Injects test metadata for GPU/NVSwitch topology

Establishes port-forward for syslog message injection

Sets the ManagedByNVSentinel label

The return values provide all necessary context for tests and teardown.

103-111: LGTM! Resilient teardown error handling.

The teardown appropriately uses warning logs instead of failing assertions for cleanup steps. This ensures:

Metadata is cleaned up even if pod deletion failed

Label removal attempts even if earlier steps had issues

Test teardown doesn't mask actual test failures with cleanup errors

91-100: The condition check logic is correct and follows the established pattern throughout the codebase.

When the monitor detects XID errors, it sets the condition with Reason: "SysLogsXIDErrorIsNotHealthy" and Status: ConditionTrue. When errors are cleared (as in the teardown after pod restart), the condition changes to Reason: "SysLogsXIDErrorIsHealthy" and Status: ConditionFalse. The code correctly waits for this healthy state transition, as confirmed by the same pattern used in smoke_test.go and other health monitor implementations.

tests/helpers/syslog-health-monitor.go

github-actions · 2026-01-12T09:27:33Z

🛡️ CodeQL Analysis

🚨 Found 1 security alert(s)

🔗 View details

Signed-off-by: Tanisha goyal <[email protected]>

coderabbitai bot reviewed Dec 22, 2025

View reviewed changes

tanishagoyal2 changed the title ~~390 event handling in syslog~~ feat: add event handling strategy changes in syslog health monitor Dec 22, 2025

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

tanishagoyal2 force-pushed the 390-event-handling-in-syslog branch from 776a34b to 0a1dc9a Compare January 7, 2026 09:02

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

tanishagoyal2 force-pushed the 390-event-handling-in-syslog branch from 0a1dc9a to 1df5558 Compare January 7, 2026 10:12

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go Show resolved Hide resolved

tanishagoyal2 force-pushed the 390-event-handling-in-syslog branch from 1df5558 to e5a8295 Compare January 7, 2026 14:57

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

tests/helpers/kube.go Show resolved Hide resolved

tanishagoyal2 force-pushed the 390-event-handling-in-syslog branch from e5a8295 to 0752eb5 Compare January 7, 2026 15:04

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

tests/syslog_health_monitor_test.go Show resolved Hide resolved

tanishagoyal2 force-pushed the 390-event-handling-in-syslog branch from 0752eb5 to 7952551 Compare January 12, 2026 04:33

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl Outdated Show resolved Hide resolved

platform-connectors/pkg/connectors/kubernetes/process_node_events.go Show resolved Hide resolved

tests/syslog_health_monitor_test.go Show resolved Hide resolved

tanishagoyal2 added 9 commits January 12, 2026 14:06

feat: add processing strategy field in health events

3e003d7

Signed-off-by: Tanisha goyal <[email protected]>

feat: add e2e tests

f43b568

Signed-off-by: Tanisha goyal <[email protected]>

feat: make review comment changes

7683d56

Signed-off-by: Tanisha goyal <[email protected]>

fix: pipeline failures

3c14753

Signed-off-by: Tanisha goyal <[email protected]>

fix: postgres failures

b67c540

Signed-off-by: Tanisha goyal <[email protected]>

chore: handle old events present in db

5f745e5

Signed-off-by: Tanisha goyal <[email protected]>

chore: some more review changes

1c9da0c

Signed-off-by: Tanisha goyal <[email protected]>

chore: revert changes from cm file

f8b7d1a

Signed-off-by: Tanisha goyal <[email protected]>

feat: add event handling strategy in syslog health monitor

74f1c71

Signed-off-by: Tanisha goyal <[email protected]>

tanishagoyal2 force-pushed the 390-event-handling-in-syslog branch from 7952551 to d8972e9 Compare January 12, 2026 08:48

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

tests/helpers/kube.go Show resolved Hide resolved

tests/helpers/kube.go Show resolved Hide resolved

chore: coderabbit review comment changes

49d8021

Signed-off-by: Tanisha goyal <[email protected]>

tanishagoyal2 force-pushed the 390-event-handling-in-syslog branch from d8972e9 to 49d8021 Compare January 12, 2026 09:10

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

tests/helpers/syslog-health-monitor.go Show resolved Hide resolved

tanishagoyal2 and others added 2 commits January 13, 2026 12:48

chore: add quote in template

2ddafe5

Signed-off-by: Tanisha goyal <[email protected]>

Merge branch 'main' into 390-event-handling-in-syslog

2b54d46

feat: add event handling strategy changes in syslog health monitor #609

Are you sure you want to change the base?

feat: add event handling strategy changes in syslog health monitor #609

Uh oh!

Conversation

tanishagoyal2 commented Dec 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Component(s) Affected

Testing

Checklist

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 22, 2025

Merging this branch will decrease overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jan 7, 2026

Merging this branch will increase overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 12, 2026

Merging this branch will increase overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tanishagoyal2 commented Dec 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 22, 2025 •

edited

Loading