Skip to content

Conversation

@tanishagoyal2
Copy link
Contributor

@tanishagoyal2 tanishagoyal2 commented Dec 22, 2025

Summary

Type of Change

  • πŸ› Bug fix
  • ✨ New feature
  • πŸ’₯ Breaking change
  • πŸ“š Documentation
  • πŸ”§ Refactoring
  • πŸ”¨ Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Testing

  1. Tested on dev cluster nvs-dgxc-k8s-oci-lhr-dev3
  2. Updated all modules with main branch changes, and updated KOM pod image with this branch changes
  3. Running KOM in STORE_ONLY mode with updated config
Screenshot 2026-01-16 at 11 26 57β€―PM
  1. Updated the testCondition to false on node
  2. KOM published event with EXECUTE_REMEDIATION
Screenshot 2026-01-16 at 10 59 01β€―PM
  1. Running KOM in EXECUTE_REMEDIATION mode and updated KOM config with override the policy to have STORE_ONLY strategy
Screenshot 2026-01-16 at 11 21 22β€―PM
  1. Updated testCondition on node to false and KOM published event with STORE_ONLY strategy
Screenshot 2026-01-16 at 11 22 15β€―PM
  1. node was not cordoned and no node condition was applied
  2. Event exporter exported event with correct strategy
Screenshot 2026-01-16 at 11 32 46β€―PM

Summary by CodeRabbit

  • New Features

    • Added processingStrategy configuration option to control health event handling behavior. Two modes available: EXECUTE_REMEDIATION (default) for active remediation, and STORE_ONLY for observability-only event tracking.
  • Tests

    • Added test coverage for STORE_ONLY strategy and policy override scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 22, 2025

πŸ“ Walkthrough

Walkthrough

This pull request introduces a new processingStrategy configuration option for the Kubernetes Object Monitor, allowing health events to either execute remediation actions (EXECUTE_REMEDIATION, default) or operate in observe-only mode (STORE_ONLY). The feature threads through Helm configuration, application initialization, and event publishing, with corresponding test additions.

Changes

Cohort / File(s) Summary
Helm Configuration
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml, templates/configmap.yaml, templates/deployment.yaml
Added new top-level processingStrategy field with default EXECUTE_REMEDIATION to values.yaml; configmap template now renders this field under healthEvent; deployment args include --processing-strategy flag sourced from values.
Application Entry & Initialization
health-monitors/kubernetes-object-monitor/main.go, pkg/initializer/initializer.go
Added processing-strategy CLI flag (default EXECUTE_REMEDIATION) in main.go; initializer.go now accepts, validates, and propagates processingStrategy through Params, with logging of configured strategy; passes strategy to Publisher and registerControllers.
Publisher & Event Emission
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
Publisher struct and constructor extended with processingStrategy field; PublishHealthEvent now resolves effective strategy (publisher default or policy override) with validation, then attaches resolved strategy to emitted HealthEvent payload.
Configuration & Annotations
health-monitors/kubernetes-object-monitor/pkg/config/types.go, pkg/annotations/manager.go
HealthEventSpec struct gained ProcessingStrategy field (toml-tagged); Manager struct and NewManager constructor extended to accept and store processingStrategy parameter.
Test Core & Helpers
health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
Updated all NewManager calls across test setup functions to pass protos.ProcessingStrategy_EXECUTE_REMEDIATION parameter, reflecting constructor signature change.
Test Functions & Utilities
tests/kubernetes_object_monitor_test.go, tests/helpers/kube.go, tests/helpers/health_events_analyzer.go, tests/helpers/kuberntest_object_monitor.go
Added two new test functions (TestKubernetesObjectMonitorWithStoreOnlyStrategy, TestKubernetesObjectMonitorWithRuleOverride) with expanded setup/teardown for STORE_ONLY and rule override scenarios; introduced ApplyNewConfigMap helper function and new KubernetesObjectMonitorTestContext struct; refactored health_events_analyzer helper to use new ApplyNewConfigMap.
Test Data
tests/data/k8s-rule-override.yaml
New ConfigMap manifest containing policy configuration for Kubernetes object monitor testing with node-specific health event metadata and processingStrategy field.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A strategy springs forth, both swift and wise,
To mend the clusterβ€”or just observe with eyes,
Through charts and code it threads its way,
From config down to health events at play,
EXECUTE or STORE, the choice is clear and bright!

πŸš₯ Pre-merge checks | βœ… 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.16% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
βœ… Passed checks (2 passed)
Check name Status Explanation
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check βœ… Passed The title 'feat: add event handling strategy changes in kubernetes object monitor' clearly and specifically describes the main feature being addedβ€”support for event handling strategy configuration in the Kubernetes Object Monitor, which aligns with the primary changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • πŸ“ Generate docstrings

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

πŸ”§ golangci-lint (2.5.0)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (8)
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml (1)

107-112: Well-documented configuration option.

The new processingStrategy field is properly documented with valid values and behavioral explanations. The default EXECUTE_REMEDIATION maintains backward compatibility.

Consider adding validation in the Helm templates to fail early if an invalid value is provided (e.g., via .Values.processingStrategy | upper | mustRegexMatch "^(EXECUTE_REMEDIATION|STORE_ONLY)$"), though this is optional if validation happens at the application level.

tests/helpers/kube.go (1)

2341-2366: Missing container-not-found check.

Unlike SetDeploymentArgs (lines 2282-2284), RemoveDeploymentArgs doesn't return an error if the specified containerName is not found in the deployment. This inconsistency could hide configuration errors.

πŸ”Ž Proposed fix
 func RemoveDeploymentArgs(
 	ctx context.Context, c klient.Client, deploymentName, namespace, containerName string, args map[string]string,
 ) error {
 	return retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		deployment := &appsv1.Deployment{}
 		if err := c.Resources().Get(ctx, deploymentName, namespace, deployment); err != nil {
 			return err
 		}

 		if len(deployment.Spec.Template.Spec.Containers) == 0 {
 			return fmt.Errorf("deployment %s/%s has no containers", namespace, deploymentName)
 		}

+		found := false
+
 		for i := range deployment.Spec.Template.Spec.Containers {
 			container := &deployment.Spec.Template.Spec.Containers[i]

 			if containerName != "" && container.Name != containerName {
 				continue
 			}

+			found = true
+
 			removeArgsFromContainer(container, args)
 		}

+		if containerName != "" && !found {
+			return fmt.Errorf("container %q not found in deployment %s/%s", containerName, namespace, deploymentName)
+		}
+
 		return c.Resources().Update(ctx, deployment)
 	})
 }
health-monitors/kubernetes-object-monitor/pkg/config/types.go (2)

49-50: Consider adding validation for ProcessingStrategy values.

The ProcessingStrategy field is a string without validation, which could allow invalid values to be silently accepted. Consider validating against allowed values (e.g., "EXECUTE_REMEDIATION", "STORE_ONLY") either at config load time or through a custom unmarshal function.


50-50: Add godoc comment for exported field.

The exported ProcessingStrategy field should have a godoc comment that describes its purpose and allowed values, following Go conventions.

As per coding guidelines: "Function comments required for all exported Go functions" (applies to exported fields as well).

health-monitors/kubernetes-object-monitor/main.go (1)

71-75: Consider validating the processing-strategy flag value.

The flag accepts any string value but only "EXECUTE_REMEDIATION" and "STORE_ONLY" are valid. Consider adding validation in the run() function to fail fast with a clear error message if an invalid value is provided.

πŸ”Ž Suggested validation
 func run() error {
 	ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
 	defer stop()
 
+	// Validate processing strategy
+	validStrategies := []string{"EXECUTE_REMEDIATION", "STORE_ONLY"}
+	isValid := false
+	for _, valid := range validStrategies {
+		if *processingStrategyFlag == valid {
+			isValid = true
+			break
+		}
+	}
+	if !isValid {
+		return fmt.Errorf("invalid processing-strategy %q, must be one of: %v", *processingStrategyFlag, validStrategies)
+	}
+
 	params := initializer.Params{
 		PolicyConfigPath:        *policyConfigPath,

Also applies to: 101-101

health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)

50-73: Strategy override logic is well-designed.

The pattern of defaulting to the publisher's strategy while allowing per-policy overrides is flexible. The validation against the generated pb.ProcessingStrategy_value map ensures consistency with the protobuf definition.

One minor observation: consider adding context to the error message (e.g., "policy %s: unexpected processingStrategy...") to help identify which policy has the invalid configuration during debugging.

πŸ”Ž Optional: Add policy context to error message
 	if policy.HealthEvent.ProcessingStrategy != "" {
 		value, ok := pb.ProcessingStrategy_value[policy.HealthEvent.ProcessingStrategy]
 		if !ok {
-			return fmt.Errorf("unexpected processingStrategy value: %q", policy.HealthEvent.ProcessingStrategy)
+			return fmt.Errorf("policy %q: unexpected processingStrategy value: %q", policy.Name, policy.HealthEvent.ProcessingStrategy)
 		}
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)

325-343: Consider using Debug log level for filtered events.

Using slog.Info for every skipped STORE_ONLY event could generate high log volume in production. Consider slog.Debug for consistency with similar skip-logging patterns elsewhere in the codebase (e.g., manager.go lines 51, 70).

πŸ”Ž Proposed change
 	for _, healthEvent := range healthEvents.Events {
 		if healthEvent.ProcessingStrategy == protos.ProcessingStrategy_STORE_ONLY {
-			slog.Info("Skipping STORE_ONLY health event (no node conditions / node events)",
+			slog.Debug("Skipping STORE_ONLY health event (no node conditions / node events)",
 				"node", healthEvent.NodeName,
 				"checkName", healthEvent.CheckName,
 				"agent", healthEvent.Agent)
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (1)

116-116: Minor: duplicate cast of strategyValue.

The pb.ProcessingStrategy(strategyValue) cast is performed twice (once on line 91 for the publisher, and again on line 116 for registerControllers). Consider storing the cast result in a variable to avoid repetition.

πŸ”Ž Proposed refactor
+	processingStrategy := pb.ProcessingStrategy(strategyValue)
+
 	slog.Info("Event handling strategy configured", "processingStrategy", params.ProcessingStrategy)

-	pub := publisher.New(pcClient, pb.ProcessingStrategy(strategyValue))
+	pub := publisher.New(pcClient, processingStrategy)

 	// ... later ...

-	if err := registerControllers(mgr, evaluator, pub, cfg.Policies, params.MaxConcurrentReconciles, pb.ProcessingStrategy(strategyValue)); err != nil {
+	if err := registerControllers(mgr, evaluator, pub, cfg.Policies, params.MaxConcurrentReconciles, processingStrategy); err != nil {
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 82e7180 and 3e2fd8b.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (26)
  • data-models/protobufs/health_event.proto
  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml
  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml
  • event-exporter/pkg/transformer/cloudevents.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • fault-quarantine/pkg/initializer/init.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • health-monitors/kubernetes-object-monitor/main.go
  • health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
  • health-monitors/kubernetes-object-monitor/pkg/config/types.go
  • health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
  • health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • store-client/pkg/client/pipeline_builder.go
  • store-client/pkg/client/pipeline_builder_test.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • tests/event_exporter_test.go
  • tests/helpers/event_exporter.go
  • tests/helpers/healthevent.go
  • tests/helpers/kube.go
  • tests/kubernetes_object_monitor_test.go
🧰 Additional context used
πŸ““ Path-based instructions (5)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • health-monitors/kubernetes-object-monitor/pkg/config/types.go
  • tests/helpers/healthevent.go
  • fault-quarantine/pkg/initializer/init.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • tests/event_exporter_test.go
  • store-client/pkg/client/pipeline_builder.go
  • event-exporter/pkg/transformer/cloudevents.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • tests/kubernetes_object_monitor_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
  • health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
  • store-client/pkg/client/pipeline_builder_test.go
  • health-monitors/kubernetes-object-monitor/main.go
  • tests/helpers/kube.go
  • health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
  • tests/helpers/event_exporter.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/event_exporter_test.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • tests/kubernetes_object_monitor_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • store-client/pkg/client/pipeline_builder_test.go
**/values.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml
data-models/protobufs/**/*.proto

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

  • data-models/protobufs/health_event.proto
**/*.py

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
🧠 Learnings (8)
πŸ“š Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

  • fault-quarantine/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `testify/assert` and `testify/require` for assertions in Go tests

Applied to files:

  • tests/event_exporter_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/event_exporter_test.go
  • tests/kubernetes_object_monitor_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • tests/helpers/kube.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/event_exporter_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • tests/event_exporter_test.go
πŸ“š Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

  • health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Define Protocol Buffer messages in `data-models/protobufs/` directory

Applied to files:

  • health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
🧬 Code graph analysis (12)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
  • ValidateCloudEvent (257-283)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
tests/kubernetes_object_monitor_test.go (4)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
  • New (41-46)
health-monitors/kubernetes-object-monitor/pkg/config/types.go (1)
  • Config (16-18)
tests/helpers/kube.go (9)
  • SetDeploymentArgs (2255-2288)
  • NVSentinelNamespace (64-64)
  • WaitForDeploymentRollout (960-1101)
  • SetNodeConditionStatus (1709-1770)
  • GetNodeByName (442-451)
  • NeverWaitTimeout (62-62)
  • WaitInterval (63-63)
  • EventuallyWaitTimeout (61-61)
  • RemoveDeploymentArgs (2341-2366)
commons/pkg/auditlogger/auditlogger.go (1)
  • Log (114-134)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
data-models/pkg/protos/health_event.pb.go (16)
  • HealthEvent (260-280)
  • HealthEvent (293-293)
  • HealthEvent (308-310)
  • Entity (208-214)
  • Entity (227-227)
  • Entity (242-244)
  • RecommendedAction (89-89)
  • RecommendedAction (139-141)
  • RecommendedAction (143-145)
  • RecommendedAction (152-154)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • ProcessingStrategy_STORE_ONLY (47-47)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
  • ToPipeline (161-163)
  • D (131-133)
  • E (126-128)
  • A (136-138)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (3)
data-models/pkg/protos/health_event_grpc.pb.go (1)
  • PlatformConnectorClient (43-45)
health-monitors/kubernetes-object-monitor/pkg/config/types.go (1)
  • Policy (20-27)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_value (56-59)
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (3)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_value (56-59)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
  • New (41-46)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (43-45)
store-client/pkg/client/pipeline_builder_test.go (3)
store-client/pkg/client/pipeline_builder.go (1)
  • PipelineBuilder (26-47)
store-client/pkg/client/mongodb_pipeline_builder.go (1)
  • NewMongoDBPipelineBuilder (29-31)
store-client/pkg/client/postgresql_pipeline_builder.go (1)
  • NewPostgreSQLPipelineBuilder (29-31)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_STORE_ONLY (47-47)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
data-models/pkg/protos/health_event.pb.go (7)
  • HealthEvent (260-280)
  • HealthEvent (293-293)
  • HealthEvent (308-310)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
πŸ”‡ Additional comments (37)
fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)

263-263: LGTM!

The test expectation correctly includes the new processingStrategy field with float64(0) (representing EXECUTE_REMEDIATION, the default enum value). This aligns with the proto changes where ProcessingStrategy defaults to 0.

data-models/protobufs/health_event.proto (2)

32-38: Well-designed enum with correct default.

Good proto3 design: EXECUTE_REMEDIATION = 0 ensures backward compatibility since unset fields default to 0 (the normal remediation behavior). The comments clearly document both strategies per coding guidelines.


77-77: Appropriate field placement.

Field number 16 continues the sequential numbering after drainOverrides = 15. This is a non-breaking addition that maintains wire compatibility with existing messages.

tests/helpers/kube.go (2)

2208-2249: LGTM!

Clean implementation of WaitForDaemonSetRollout that follows the same pattern as WaitForDeploymentRollout. The rollout completion checks (desired, updated, ready pods) are correct.


2290-2336: Approve helper implementation.

The setArgsOnContainer helper correctly handles the three argument styles (--flag=value, --flag, --flag value) with appropriate insertion and update logic. The break statements prevent index issues during slice modification.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (2)

14-17: LGTM!

The ProcessingStrategy enum class and top-level constants are correctly defined, matching the protobuf definition.

Also applies to: 31-32


78-78: LGTM!

The HealthEvent message correctly includes the new processingStrategy field in __slots__, field number constant, attribute declaration, and __init__ signature. The type hint _Optional[_Union[ProcessingStrategy, str]] is appropriate for protobuf enum fields.

Also applies to: 104-104, 120-120, 138-138

tests/event_exporter_test.go (2)

25-25: LGTM!

The import addition aligns with the updated test helper function signature.


85-85: LGTM!

The test correctly validates the expected processing strategy value. The change properly extends the test to verify the new processingStrategy field in CloudEvents.

event-exporter/pkg/transformer/cloudevents.go (1)

66-66: LGTM!

The addition of processingStrategy to the CloudEvent data payload correctly uses the .String() method to serialize the enum value. The placement is consistent with other health event fields.

distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml (1)

51-53: LGTM!

The conditional rendering of processingStrategy follows the same pattern as the errorCode field above (lines 48-50) and properly quotes the value for TOML format. The optional nature ensures backward compatibility when the field is not specified.

event-exporter/pkg/transformer/cloudevents_test.go (2)

69-69: LGTM!

Setting ProcessingStrategy_STORE_ONLY provides good test coverage for the non-default enum value.


106-108: LGTM!

The validation correctly verifies that the processingStrategy field is properly serialized as "STORE_ONLY" in the CloudEvent data. This confirms the .String() method works as expected.

store-client/pkg/client/mongodb_pipeline_builder.go (1)

19-19: LGTM!

The import is necessary to reference ProcessingStrategy_EXECUTE_REMEDIATION in the new pipeline.

distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yaml (1)

57-57: No action needed. The processingStrategy value has a default value of EXECUTE_REMEDIATION already defined in values.yaml at line 112.

fault-quarantine/pkg/initializer/init.go (1)

66-66: LGTM! Filtering STORE_ONLY events is the intended behavior.

The change from BuildAllHealthEventInsertsPipeline() to BuildProcessableHealthEventInsertsPipeline() correctly enables fault-quarantine to process only events with ProcessingStrategy = EXECUTE_REMEDIATION, excluding observability-only STORE_ONLY events. This aligns with the PR objectives for conditional event processing.

Producers are properly configured to set ProcessingStrategy based on policy definitions, with EXECUTE_REMEDIATION as the default. Observability-only events must be explicitly marked as STORE_ONLY via policy configuration, so critical events won't be accidentally filtered.

store-client/pkg/client/pipeline_builder_test.go (1)

69-86: LGTM! Test follows established patterns.

The new test for BuildProcessableHealthEventInsertsPipeline correctly mirrors the structure of existing pipeline tests, using table-driven approach across both MongoDB and PostgreSQL builders.

store-client/pkg/client/pipeline_builder.go (1)

35-38: LGTM! Clear interface addition with good documentation.

The new method is well-documented with clear use case and filtering behavior.

tests/helpers/healthevent.go (1)

48-48: LGTM! Builder pattern follows established conventions.

The ProcessingStrategy field and builder method follow the existing patterns in this test helper. Using int type provides flexibility for testing edge cases beyond the defined enum values.

Also applies to: 153-156

store-client/pkg/client/postgresql_pipeline_builder.go (1)

119-132: LGTM! Implementation correctly filters for EXECUTE_REMEDIATION strategy.

The pipeline matches insert operations where processingStrategy equals EXECUTE_REMEDIATION, consistent with the interface contract and MongoDB implementation.

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)

1391-1589: Test logic validates STORE_ONLY behavior correctly, but consider migration to envtest.

The test effectively validates that STORE_ONLY events don't create node conditions or Kubernetes events, while EXECUTE_REMEDIATION events do. Test cases cover the key scenarios well.

However, the test uses fake.NewSimpleClientset() rather than envtest for Kubernetes controller testing. Based on learnings, envtest is preferred for testing Kubernetes controllers.

Consider migrating this test to use envtest in a future refactor to align with the coding guidelines. The fake client is acceptable for unit tests but envtest provides better integration testing.

tests/helpers/event_exporter.go (2)

220-254: LGTM! Useful helper for finding events by check name.

The new FindEventByNodeAndCheckName function provides a clean way to search for events by multiple criteria including health status.


261-261: LGTM! Validation extended to include processingStrategy.

The ValidateCloudEvent function correctly updated to validate the processingStrategy field in CloudEvent payloads. This ensures tests verify the processing strategy is properly propagated through the event pipeline.

Also applies to: 281-281

tests/kubernetes_object_monitor_test.go (4)

128-159: LGTM! Test setup correctly configures STORE_ONLY strategy.

The setup properly:

  • Identifies a non-KWOK test node
  • Applies deployment args to enable STORE_ONLY processing
  • Waits for deployment rollout before proceeding

161-188: LGTM! Correct use of require.Never for negative assertion.

The test correctly validates that STORE_ONLY strategy does not create node annotations by using require.Never, which asserts the condition never becomes true within the timeout period.


190-217: Clarify the purpose of the "Node Ready recovery clears annotation" assessment.

This assessment expects an annotation to be cleared when the node becomes Ready, but in STORE_ONLY mode, the annotation should never have been created in the first place (as verified by the previous assessment).

Is this assessment intended to verify that:

  1. The annotation remains absent when the condition changes to True? (If so, require.Never would be more appropriate)
  2. Some edge case where an annotation might exist from a previous non-STORE_ONLY run?

Consider whether this assessment should use require.Never instead of require.Eventually to verify the annotation continues to not exist, or add a comment explaining why Eventually is appropriate here.


219-230: LGTM! Teardown properly restores deployment state.

The teardown correctly removes the STORE_ONLY args and waits for the deployment to stabilize, ensuring test isolation.

health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (4)

24-24: LGTM!

Import for the protobuf package is correctly added to support the new ProcessingStrategy type.


38-45: LGTM!

The Manager struct and constructor are correctly extended to accept and store the processingStrategy. The field naming follows Go conventions.


50-53: LGTM!

The early return for STORE_ONLY strategy correctly prevents annotation updates while maintaining debug logging for observability. This aligns with the PR objective to support different event handling strategies.


69-72: LGTM!

Consistent implementation with AddMatch - the STORE_ONLY guard correctly skips annotation removal with appropriate debug logging.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

1-51: Generated protobuf code - no manual review required.

This file is auto-generated by the protocol buffer compiler as indicated by the header comments. Ensure this file is regenerated from the source .proto file rather than manually edited. The changes correctly reflect the addition of the ProcessingStrategy enum and processingStrategy field in the HealthEvent message.

health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)

36-45: LGTM!

The Publisher struct and constructor are correctly extended to store and initialize the default processingStrategy. The naming convention follows Go standards.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)

372-416: LGTM on the overall flow.

The processHealthEvents function correctly:

  1. Filters out STORE_ONLY events before processing
  2. Updates node conditions only for processable healthy/fatal events
  3. Creates Kubernetes events for non-healthy, non-fatal events

The separation of filtering, condition updates, and event creation is clean and maintainable.

health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (3)

47-55: LGTM!

The Params struct is correctly extended with ProcessingStrategy field. This allows the strategy to be configured externally (likely via command-line flag or environment variable based on the AI summary).


163-171: LGTM!

The registerControllers function signature is correctly updated to accept processingStrategy, and it's properly passed to annotations.NewManager. This ensures the strategy is consistently applied throughout the controller initialization path.


83-91: Default value handling is already provided at the flag level in main.go.

The processingStrategyFlag has a default value of "EXECUTE_REMEDIATION" (line 73), which is passed to the initializer via *processingStrategyFlag. This ensures that under normal execution, params.ProcessingStrategy will never be empty. However, the validation at line 84-86 will still correctly reject invalid or empty values if the initializer is called directly with unvalidated input. Consider adding a comment to document this dependency on the caller providing a valid value.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 3e2fd8b and f6e73ed.

πŸ“’ Files selected for processing (1)
  • tests/kubernetes_object_monitor_test.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/kubernetes_object_monitor_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/kubernetes_object_monitor_test.go
🧠 Learnings (2)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/kubernetes_object_monitor_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/kubernetes_object_monitor_test.go
🧬 Code graph analysis (1)
tests/kubernetes_object_monitor_test.go (4)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
  • New (41-46)
tests/helpers/kube.go (7)
  • SetDeploymentArgs (2255-2288)
  • NVSentinelNamespace (64-64)
  • WaitForDeploymentRollout (960-1101)
  • SetNodeConditionStatus (1709-1770)
  • GetNodeByName (442-451)
  • EventuallyWaitTimeout (61-61)
  • WaitInterval (63-63)
tests/helpers/fault_quarantine.go (2)
  • AssertQuarantineState (315-382)
  • QuarantineAssertion (56-60)
tests/helpers/event_exporter.go (3)
  • GetMockEvents (36-99)
  • FindEventByNodeAndCheckName (221-254)
  • ValidateCloudEvent (257-283)
πŸ”‡ Additional comments (1)
tests/kubernetes_object_monitor_test.go (1)

24-24: LGTM!

The import is correctly added to support the helper functions used in the new test.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-kubernetes-monitor branch from f6e73ed to 8fc966f Compare December 25, 2025 12:44
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
tests/kubernetes_object_monitor_test.go (1)

191-218: Redundant annotation clearing check.

This assess step waits for the annotation to be cleared, but the previous assess (lines 171-187) already verified that the annotation is never set when using the STORE_ONLY strategy. Waiting for an annotation to be cleared when it was never created is unnecessary and adds confusion to the test flow.

Consider removing this entire assess step, or if you want to keep it for defensive reasons, add a comment explaining why this check is necessary despite the earlier assertion that the annotation is never set.

πŸ”Ž Suggested simplification
-	feature.Assess("Node Ready recovery clears annotation", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
-		client, err := c.NewClient()
-		require.NoError(t, err)
-
-		nodeName := ctx.Value(k8sMonitorKeyNodeName).(string)
-		t.Logf("Setting TestCondition to True on node %s", nodeName)
-
-		helpers.SetNodeConditionStatus(ctx, t, client, nodeName, v1.NodeConditionType(testConditionType), v1.ConditionTrue)
-
-		t.Log("Waiting for policy match annotation to be cleared")
-		require.Eventually(t, func() bool {
-			node, err := helpers.GetNodeByName(ctx, client, nodeName)
-			if err != nil {
-				t.Logf("Failed to get node: %v", err)
-				return false
-			}
-
-			annotation, exists := node.Annotations[annotationKey]
-			if exists && annotation != "" {
-				t.Logf("Annotation still exists: %s", annotation)
-				return false
-			}
-
-			return true
-		}, helpers.EventuallyWaitTimeout, helpers.WaitInterval)
-
-		return ctx
-	})
-
🧹 Nitpick comments (4)
tests/helpers/kube.go (1)

387-409: LGTM! Function correctly implements negative event assertion.

The logic properly uses require.Never to ensure the specified event type and reason do not appear on the node.

Optional: Consider improving log message clarity

At line 405, the log message could include both eventType and eventReason for consistency with the assertion message at line 408:

-		t.Logf("node %s does not have event %v", nodeName, eventType)
+		t.Logf("node %s does not have event type=%s reason=%s", nodeName, eventType, eventReason)
tests/platform-connector_test.go (1)

28-32: Remove unused struct fields.

The ConfigMapBackup and TestNamespace fields are declared but never used anywhere in the test. Consider removing them to keep the code clean.

πŸ”Ž Proposed cleanup
 type PlatformConnectorTestContext struct {
 	NodeName        string
-	ConfigMapBackup []byte
-	TestNamespace   string
 }
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (2)

84-87: Consider case-insensitive processing strategy validation.

The validation is case-sensitive and requires an exact match with the proto enum names ("EXECUTE_REMEDIATION" or "STORE_ONLY"). This could lead to user confusion if they provide lowercase values like "store_only".

Consider normalizing the input with strings.ToUpper() before validation, or enhance the error message to list the valid values.

πŸ”Ž Suggested improvement
-	strategyValue, ok := pb.ProcessingStrategy_value[params.ProcessingStrategy]
+	strategyValue, ok := pb.ProcessingStrategy_value[strings.ToUpper(params.ProcessingStrategy)]
 	if !ok {
-		return nil, fmt.Errorf("unexpected processingStrategy value: %q", params.ProcessingStrategy)
+		return nil, fmt.Errorf("unexpected processingStrategy value: %q (valid values: EXECUTE_REMEDIATION, STORE_ONLY)", params.ProcessingStrategy)
 	}

54-54: Document the ProcessingStrategy field.

The new ProcessingStrategy field in the Params struct would benefit from a comment explaining its purpose and valid values, especially since it's part of a public API.

 	MaxConcurrentReconciles int
 	PlatformConnectorSocket string
+	// ProcessingStrategy determines how health events are processed.
+	// Valid values: "EXECUTE_REMEDIATION" (default behavior with remediation actions)
+	//               "STORE_ONLY" (events are stored but no remediation is triggered)
 	ProcessingStrategy      string
 }
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between f6e73ed and 8fc966f.

πŸ“’ Files selected for processing (19)
  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml
  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml
  • distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml
  • docs/postgresql-schema.sql
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • fault-quarantine/pkg/initializer/init.go
  • health-monitors/kubernetes-object-monitor/main.go
  • health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
  • health-monitors/kubernetes-object-monitor/pkg/config/types.go
  • health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
  • health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
  • tests/event_exporter_test.go
  • tests/fault_quarantine_test.go
  • tests/helpers/event_exporter.go
  • tests/helpers/healthevent.go
  • tests/helpers/kube.go
  • tests/kubernetes_object_monitor_test.go
  • tests/platform-connector_test.go
🚧 Files skipped from review as they are similar to previous changes (7)
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yaml
  • health-monitors/kubernetes-object-monitor/main.go
  • health-monitors/kubernetes-object-monitor/pkg/config/types.go
  • tests/event_exporter_test.go
  • health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
  • health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
🧰 Additional context used
πŸ““ Path-based instructions (3)
**/values.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/helpers/healthevent.go
  • tests/fault_quarantine_test.go
  • tests/helpers/event_exporter.go
  • fault-quarantine/pkg/initializer/init.go
  • tests/kubernetes_object_monitor_test.go
  • health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
  • tests/helpers/kube.go
  • tests/platform-connector_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/fault_quarantine_test.go
  • tests/kubernetes_object_monitor_test.go
  • tests/platform-connector_test.go
🧠 Learnings (8)
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml
  • tests/helpers/event_exporter.go
  • tests/platform-connector_test.go
  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • tests/fault_quarantine_test.go
  • tests/kubernetes_object_monitor_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/fault_quarantine_test.go
  • tests/kubernetes_object_monitor_test.go
  • tests/platform-connector_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/fault_quarantine_test.go
  • tests/kubernetes_object_monitor_test.go
  • tests/helpers/kube.go
  • tests/platform-connector_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

  • fault-quarantine/pkg/initializer/init.go
πŸ“š Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

  • health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
  • tests/platform-connector_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/platform-connector_test.go
🧬 Code graph analysis (5)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
tests/fault_quarantine_test.go (4)
tests/helpers/fault_quarantine.go (4)
  • QuarantineTestContext (51-54)
  • SetupQuarantineTest (107-112)
  • AssertQuarantineState (315-382)
  • QuarantineAssertion (56-60)
tests/helpers/kube.go (1)
  • SetNodeManagedByNVSentinel (1389-1408)
tests/helpers/healthevent.go (3)
  • NewHealthEvent (60-76)
  • SendHealthEvent (263-275)
  • SendHealthyEvent (277-287)
data-models/pkg/protos/health_event.pb.go (2)
  • ProcessingStrategy_STORE_ONLY (47-47)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
tests/kubernetes_object_monitor_test.go (1)
tests/helpers/kube.go (9)
  • SetDeploymentArgs (2279-2312)
  • NVSentinelNamespace (64-64)
  • WaitForDeploymentRollout (984-1125)
  • SetNodeConditionStatus (1733-1794)
  • GetNodeByName (466-475)
  • NeverWaitTimeout (62-62)
  • WaitInterval (63-63)
  • EventuallyWaitTimeout (61-61)
  • RemoveDeploymentArgs (2365-2390)
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (3)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_value (56-59)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
  • New (41-46)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (43-45)
tests/platform-connector_test.go (4)
tests/helpers/kube.go (5)
  • SelectTestNodeFromUnusedPool (413-437)
  • EnsureNodeConditionNotPresent (1797-1818)
  • EnsureNodeEventNotPresent (387-409)
  • CheckNodeConditionExists (1283-1300)
  • CheckNodeEventExists (1325-1353)
tests/helpers/healthevent.go (3)
  • NewHealthEvent (60-76)
  • SendHealthEvent (263-275)
  • SendHealthyEvent (277-287)
tests/helpers/health_events_analyzer.go (2)
  • ERRORCODE_79 (35-35)
  • ERRORCODE_31 (32-32)
data-models/pkg/protos/health_event.pb.go (2)
  • ProcessingStrategy_STORE_ONLY (47-47)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
πŸ”‡ Additional comments (16)
tests/helpers/kube.go (3)

2232-2273: LGTM! DaemonSet rollout verification is correct.

The function properly checks all relevant status fields (DesiredNumberScheduled, UpdatedNumberScheduled, NumberReady) to verify rollout completion, with appropriate logging at each stage.


2275-2360: LGTM! Deployment args manipulation is well-implemented.

The function correctly uses retry.RetryOnConflict without wrapping errors (as per coding guidelines), and setArgsOnContainer properly handles the three argument formats: --flag=value, --flag value, and --flag (boolean).


2362-2417: LGTM! Argument removal logic is correct.

The docstring correctly states "removes container arguments" and the implementation properly handles removal of arguments in different formats (--flag=value, --flag value, --flag). The function correctly uses retry.RetryOnConflict without wrapping errors.

Note: The past review comment about an incorrect docstring referencing environment variables appears to have been resolved in the current code.

tests/helpers/healthevent.go (2)

48-48: LGTM!

The ProcessingStrategy field is correctly defined with an appropriate type and JSON tag, consistent with other optional fields in the struct.


153-156: LGTM!

The builder method follows the established pattern and enables fluent chaining, consistent with all other With* methods in this helper.

tests/helpers/event_exporter.go (2)

220-254: LGTM!

The FindEventByNodeAndCheckName function correctly implements CloudEvent search by nodeName, checkName, and isHealthy status, following the same defensive pattern as the existing FindEventByNodeAndMessage function.


256-283: LGTM!

The ValidateCloudEvent function is correctly extended to validate the processingStrategy field, following the same assertion pattern as other health event fields.

distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml (1)

219-222: LGTM!

The processing_strategy column is correctly added to the health_events table schema with an appropriate VARCHAR(50) type. The column is nullable, which aligns with the optional nature of this field in the proto definition and JSON serialization.

fault-quarantine/pkg/initializer/init.go (1)

66-66: LGTM!

Correctly switches to BuildProcessableHealthEventInsertsPipeline() to ensure fault-quarantine only processes EXECUTE_REMEDIATION events, filtering out STORE_ONLY observability events.

docs/postgresql-schema.sql (1)

106-109: LGTM!

The processing_strategy column is correctly added to the canonical PostgreSQL schema and matches the corresponding change in values-tilt-postgresql.yaml, maintaining consistency between the two schema definitions as required.

distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml (1)

51-53: LGTM!

The processingStrategy field is correctly rendered in the ConfigMap template with proper conditional logic, indentation, and quoting, following the same pattern as the errorCode field.

tests/fault_quarantine_test.go (4)

26-26: LGTM!

The import of the protos package is necessary to reference the ProcessingStrategy enum constants used in the test.


233-250: LGTM!

The test setup correctly initializes the test context and marks the node as managed by NVSentinel, which is required for the quarantine behavior being tested.


252-286: LGTM!

The test assessments correctly verify that STORE_ONLY events do not trigger quarantine while EXECUTE_REMEDIATION events do, effectively validating the core behavior of the processing strategy feature.


288-294: LGTM!

The teardown properly cleans up by sending a healthy event and calling the standard teardown helper.

distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml (1)

107-112: LGTM!

The processingStrategy field is well-documented with clear explanations of both modes and sensible defaults. The default value EXECUTE_REMEDIATION maintains backward compatibility while enabling the new observability-only mode when needed.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-kubernetes-monitor branch from 8fc966f to 265ed4b Compare January 13, 2026 09:51
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

πŸ€– Fix all issues with AI agents
In @tests/platform-connector_test.go:
- Around line 82-93: The test is calling CheckNodeConditionExists and
CheckNodeEventExists but ignoring their return values, so assertions are
ineffective; update the test to capture each call's boolean/result and use
t.Fatalf/t.Errorf or require/assert to fail the test when the check returns
false (e.g., assign condOk := helpers.CheckNodeConditionExists(...); if !condOk
{ t.Fatalf("expected condition SysLogsXIDError on %s", testCtx.NodeName) }) and
do the same for the event call (eventOk := helpers.CheckNodeEventExists(...)) so
failures are reported and the test actually validates the expected state.
🧹 Nitpick comments (2)
tests/platform-connector_test.go (1)

28-32: Remove unused struct fields.

ConfigMapBackup and TestNamespace are defined but never used in this test. Consider removing them to avoid confusion.

♻️ Suggested refactor
 type PlatformConnectorTestContext struct {
 	NodeName        string
-	ConfigMapBackup []byte
-	TestNamespace   string
 }
tests/kubernetes_object_monitor_test.go (1)

191-218: Test assertion is trivially satisfied.

Since the previous Assess block verified that the annotation was never created with STORE_ONLY, this assertion for "annotation cleared" will pass trivially because the annotation doesn't exist to begin with. Consider either:

  1. Renaming the assess description to "Node Ready maintains no annotation"
  2. Adding explicit logging to clarify the expected behavior

This doesn't affect test correctness but could be misleading when reviewing test results.

♻️ Suggested clarification
-	feature.Assess("Node Ready recovery clears annotation", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
+	feature.Assess("Node Ready maintains no annotation under STORE_ONLY", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 8fc966f and 265ed4b.

πŸ“’ Files selected for processing (16)
  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml
  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml
  • distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml
  • docs/postgresql-schema.sql
  • health-monitors/kubernetes-object-monitor/main.go
  • health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
  • health-monitors/kubernetes-object-monitor/pkg/config/types.go
  • health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
  • health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
  • health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
  • tests/event_exporter_test.go
  • tests/helpers/event_exporter.go
  • tests/helpers/kube.go
  • tests/kubernetes_object_monitor_test.go
  • tests/platform-connector_test.go
🚧 Files skipped from review as they are similar to previous changes (9)
  • tests/event_exporter_test.go
  • health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
  • docs/postgresql-schema.sql
  • health-monitors/kubernetes-object-monitor/main.go
  • tests/helpers/event_exporter.go
  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml
  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml
  • distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml
  • distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yaml
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
  • health-monitors/kubernetes-object-monitor/pkg/config/types.go
  • health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
  • health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
  • tests/helpers/kube.go
  • tests/platform-connector_test.go
  • tests/kubernetes_object_monitor_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
  • tests/platform-connector_test.go
  • tests/kubernetes_object_monitor_test.go
🧠 Learnings (9)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • health-monitors/kubernetes-object-monitor/pkg/config/types.go
  • tests/platform-connector_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
  • tests/kubernetes_object_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
  • tests/platform-connector_test.go
  • tests/kubernetes_object_monitor_test.go
πŸ“š Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

  • health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
  • tests/platform-connector_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/platform-connector_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • tests/kubernetes_object_monitor_test.go
🧬 Code graph analysis (4)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (3)
data-models/pkg/protos/health_event_grpc.pb.go (1)
  • PlatformConnectorClient (43-45)
health-monitors/kubernetes-object-monitor/pkg/config/types.go (1)
  • Policy (20-27)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_value (59-63)
health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go (2)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (43-45)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (48-48)
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (3)
data-models/pkg/protos/health_event.pb.go (5)
  • ProcessingStrategy (44-44)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (80-82)
  • ProcessingStrategy (89-91)
  • ProcessingStrategy_value (59-63)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
  • New (41-46)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (43-45)
tests/platform-connector_test.go (4)
tests/helpers/kube.go (5)
  • SelectTestNodeFromUnusedPool (413-437)
  • EnsureNodeConditionNotPresent (1797-1818)
  • EnsureNodeEventNotPresent (387-409)
  • CheckNodeConditionExists (1283-1300)
  • CheckNodeEventExists (1325-1353)
tests/helpers/healthevent.go (3)
  • NewHealthEvent (60-76)
  • SendHealthEvent (263-275)
  • SendHealthyEvent (277-287)
tests/helpers/health_events_analyzer.go (2)
  • ERRORCODE_79 (35-35)
  • ERRORCODE_31 (32-32)
data-models/pkg/protos/health_event.pb.go (2)
  • ProcessingStrategy_STORE_ONLY (49-49)
  • ProcessingStrategy_EXECUTE_REMEDIATION (48-48)
πŸ”‡ Additional comments (21)
health-monitors/kubernetes-object-monitor/pkg/config/types.go (1)

49-50: LGTM!

The new ProcessingStrategy field in HealthEventSpec allows per-policy override of the global processing strategy. The string type is appropriate for TOML configuration and will be validated/mapped to the protobuf enum in the publisher.

tests/platform-connector_test.go (1)

34-51: LGTM on test setup structure.

The test setup correctly selects a test node from the unused pool and stores it in the context for use in subsequent assess/teardown phases.

tests/helpers/kube.go (4)

2232-2273: LGTM!

WaitForDaemonSetRollout is well-implemented with proper checks for DesiredNumberScheduled, UpdatedNumberScheduled, and NumberReady. Good use of t.Helper() and descriptive logging.


2275-2337: LGTM!

SetDeploymentArgs and setArgsOnContainer correctly handle both --flag=value and --flag value styles. The use of retry.RetryOnConflict ensures safe updates under concurrent access. As per coding guidelines, errors are returned without wrapping within the retry block to preserve retry behavior.


2339-2369: LGTM!

tryUpdateExistingArg handles the complexity of updating args in both --flag=value and --flag value styles correctly. The slice insertion logic at line 2361 properly handles the case where a value needs to be inserted after a standalone flag.


2371-2433: LGTM!

RemoveDeploymentArgs and removeArgsFromContainer correctly handle removal of both arg styles. The logic properly removes the flag and its associated value when using --flag value style (lines 2423-2424).

health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (2)

36-46: LGTM!

The Publisher struct now properly stores the processingStrategy and the constructor correctly initializes it. This aligns with the initialization flow that propagates the strategy from CLI flags through to the publisher.


48-74: LGTM!

The strategy resolution logic is well-designed:

  1. Uses the publisher's default strategy
  2. Allows per-policy override via policy.HealthEvent.ProcessingStrategy
  3. Validates override values against the protobuf enum map
  4. Fails fast with a clear error for invalid values

The ProcessingStrategy is correctly included in the HealthEvent payload.

health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (4)

54-54: LGTM!

Adding ProcessingStrategy to Params allows the CLI flag value to be propagated through initialization.


83-91: LGTM!

The validation logic correctly:

  1. Looks up the strategy string in pb.ProcessingStrategy_value
  2. Returns a clear error for invalid values
  3. Logs the configured strategy for observability
  4. Creates the publisher with the validated strategy value

116-117: LGTM!

The processing strategy is correctly passed to registerControllers, ensuring the annotation manager receives the same strategy configuration.


170-172: LGTM!

The registerControllers function signature is updated to accept processingStrategy and correctly passes it to annotations.NewManager. This ensures the annotation manager respects the configured strategy when deciding whether to update/remove annotations.

tests/kubernetes_object_monitor_test.go (4)

24-25: LGTM!

Import added correctly for the helpers package.


128-160: LGTM!

The test setup correctly:

  1. Finds a non-KWOK node for testing
  2. Sets the --processing-strategy=STORE_ONLY deployment arg
  3. Waits for the deployment rollout to complete

This ensures the kubernetes-object-monitor is running with the STORE_ONLY strategy before assertions.


162-189: LGTM!

Good use of require.Never to assert that with STORE_ONLY strategy, no policy match annotation is applied when the node condition becomes unhealthy. This validates the core STORE_ONLY behavior.


220-232: LGTM!

Proper teardown that removes the --processing-strategy arg and waits for deployment rollout, restoring the original state for subsequent tests.

health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go (5)

37-37: LGTM!

Correct import of the protos package to access ProcessingStrategy_EXECUTE_REMEDIATION.


576-576: LGTM!

Updated to pass protos.ProcessingStrategy_EXECUTE_REMEDIATION to annotations.NewManager, matching the new constructor signature. Using EXECUTE_REMEDIATION preserves the expected behavior for existing tests.


631-631: LGTM!

Consistent with the other setup functions.


739-739: LGTM!

Consistent update in restartReconciler.


777-777: LGTM!

Consistent update in restartReconcilerWithCRD. All test setup paths now correctly pass the processing strategy to the annotation manager.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-kubernetes-monitor branch from 265ed4b to 112a762 Compare January 13, 2026 10:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
tests/helpers/kube.go (1)

2339-2369: Consider simplifying the --flag value update logic.

The slice insertion at line 2361 is complex. When the flag exists without a value (e.g., --verbose) and you want to add a value, inserting into the middle of a slice with append(container.Args[:j+1], append([]string{value}, container.Args[j+1:]...)...) works but is hard to follow.

Also, at line 2358-2359, if the next argument starts with -, you insert the value after the flag. However, if --flag was originally a boolean flag (no value), this insertion may be unintended.

♻️ Suggested simplification

Consider documenting the expected behavior more explicitly or simplifying by always using --flag=value style when updating:

 // Match --flag or --flag value style
 if existingArg == flag {
     if value != "" {
         if j+1 < len(container.Args) && !strings.HasPrefix(container.Args[j+1], "-") {
+            // Update existing separate value
             container.Args[j+1] = value
         } else {
-            container.Args = append(container.Args[:j+1], append([]string{value}, container.Args[j+1:]...)...)
+            // Convert to --flag=value style for simplicity
+            container.Args[j] = flag + "=" + value
         }
     }
 
     return true
 }
tests/kubernetes_object_monitor_test.go (3)

162-189: Test name doesn't match assertion.

The assess title says "triggers health event" but the test only verifies that annotations are NOT applied (STORE_ONLY behavior). Consider renaming to better reflect what's being tested, e.g., "Node NotReady does not apply annotation with STORE_ONLY strategy".

πŸ“ Suggested rename
-	feature.Assess("Node NotReady triggers health event with STORE_ONLY strategy", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
+	feature.Assess("Node NotReady does not apply annotation with STORE_ONLY strategy", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {

191-218: Misleading test semantics in STORE_ONLY context.

In STORE_ONLY mode, the annotation is never applied (validated by the previous assess). This test checking "annotation to be cleared" is semantically misleading since there's nothing to clear. The test will pass immediately because the annotation doesn't exist.

Consider either:

  1. Removing this assess since it's redundant in STORE_ONLY mode, or
  2. Renaming to clarify intent, e.g., "Node Ready recovery keeps annotation absent with STORE_ONLY strategy"
πŸ“ Suggested rename if keeping the test
-	feature.Assess("Node Ready recovery clears annotation", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
+	feature.Assess("Node Ready recovery keeps annotation absent with STORE_ONLY strategy", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
-		t.Log("Waiting for policy match annotation to be cleared")
+		t.Log("Verifying policy match annotation remains absent")

220-232: Consider adding node condition cleanup to teardown.

The teardown correctly restores the deployment args. However, if the test fails midway (e.g., after setting TestCondition to False but before recovery), the node condition might be left in an unhealthy state, potentially affecting other tests.

πŸ“ Suggested addition for robustness
 	feature.Teardown(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
 		client, err := c.NewClient()
 		require.NoError(t, err)

+		// Ensure node condition is restored to healthy state
+		nodeName := ctx.Value(k8sMonitorKeyNodeName).(string)
+		helpers.SetNodeConditionStatus(ctx, t, client, nodeName, v1.NodeConditionType(testConditionType), v1.ConditionTrue)
+
 		err = helpers.RemoveDeploymentArgs(ctx, client, "kubernetes-object-monitor", helpers.NVSentinelNamespace, "", map[string]string{
 			"--processing-strategy": "STORE_ONLY",
 		})
 		require.NoError(t, err)

 		helpers.WaitForDeploymentRollout(ctx, t, client, "kubernetes-object-monitor", helpers.NVSentinelNamespace)

 		return ctx
 	})
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 265ed4b and 112a762.

πŸ“’ Files selected for processing (6)
  • health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
  • health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
  • health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
  • tests/helpers/event_exporter.go
  • tests/helpers/kube.go
  • tests/kubernetes_object_monitor_test.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/kubernetes_object_monitor_test.go
  • health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
  • tests/helpers/event_exporter.go
  • health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
  • health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
  • tests/helpers/kube.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/kubernetes_object_monitor_test.go
  • health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
🧠 Learnings (8)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • tests/kubernetes_object_monitor_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/kubernetes_object_monitor_test.go
  • health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/kubernetes_object_monitor_test.go
  • health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
  • health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Define Protocol Buffer messages in `data-models/protobufs/` directory

Applied to files:

  • health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • tests/helpers/event_exporter.go
πŸ“š Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

  • health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
🧬 Code graph analysis (3)
tests/kubernetes_object_monitor_test.go (1)
tests/helpers/kube.go (8)
  • SetDeploymentArgs (2279-2313)
  • NVSentinelNamespace (64-64)
  • WaitForDeploymentRollout (984-1125)
  • SetNodeConditionStatus (1733-1794)
  • GetNodeByName (466-475)
  • NeverWaitTimeout (62-62)
  • WaitInterval (63-63)
  • EventuallyWaitTimeout (61-61)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
data-models/pkg/protos/health_event.pb.go (5)
  • ProcessingStrategy (44-44)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (80-82)
  • ProcessingStrategy (89-91)
  • ProcessingStrategy_STORE_ONLY (49-49)
health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go (1)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (48-48)
πŸ”‡ Additional comments (17)
tests/helpers/event_exporter.go (2)

220-255: LGTM!

The new FindEventByNodeAndCheckName function follows the established pattern of FindEventByNodeAndMessage and correctly searches for events matching nodeName, checkName, and isHealthy status. The type assertions and nil checks are consistent with the existing code.


257-283: LGTM!

The ValidateCloudEvent function correctly validates the new processingStrategy field within the healthEvent payload. The addition aligns with the PR's objective to propagate processing strategy through CloudEvents.

health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (4)

24-24: LGTM!

Import alias pb for the protos package follows consistent naming conventions used elsewhere in the codebase.


38-45: LGTM!

The Manager struct extension and constructor update correctly propagate the processingStrategy configuration. The field is properly typed as pb.ProcessingStrategy enum.


47-65: LGTM!

The guard in AddMatch correctly prevents annotation updates when STORE_ONLY strategy is configured. The debug logging provides good observability for troubleshooting.


68-90: LGTM!

The guard in RemoveMatch mirrors the AddMatch guard, correctly skipping annotation removal for STORE_ONLY strategy. The implementation is consistent and the logging is helpful.

health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go (2)

37-37: LGTM!

Import for the protos package correctly added to support the ProcessingStrategy enum usage.


576-576: LGTM!

All test setup functions (setupTestWithPolicies, setupTestWithCRD, restartReconciler, restartReconcilerWithCRD) consistently pass protos.ProcessingStrategy_EXECUTE_REMEDIATION to annotations.NewManager. This ensures existing tests continue to verify annotation modification behavior.

Also applies to: 631-631, 739-739, 777-777

tests/helpers/kube.go (3)

2231-2273: LGTM!

WaitForDaemonSetRollout correctly implements rollout completion detection by verifying that UpdatedNumberScheduled and NumberReady both match DesiredNumberScheduled. The progress logging is helpful for debugging test failures.


2275-2313: LGTM!

SetDeploymentArgs follows the established pattern of SetDeploymentEnvVars with proper retry-on-conflict handling and container targeting logic.


2409-2433: LGTM!

removeArgsFromContainer correctly handles both --flag=value and --flag value argument styles. The slice manipulation for removing 1 or 2 elements is correct.

health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (4)

47-55: LGTM!

The ProcessingStrategy field in Params correctly captures the CLI flag value as a string for validation.


83-91: LGTM!

The validation uses the protobuf-generated pb.ProcessingStrategy_value map, which ensures only valid enum values are accepted. The error message is descriptive. The type conversion pb.ProcessingStrategy(strategyValue) is safe since strategyValue is obtained from the valid enum value map.


116-120: LGTM!

The processingStrategy is correctly propagated to registerControllers, maintaining consistency across the initialization flow.


164-172: LGTM!

The registerControllers function correctly receives the processingStrategy parameter and passes it to annotations.NewManager. The annotation manager is appropriately shared across all reconcilers for a given GVK.

tests/kubernetes_object_monitor_test.go (2)

24-25: LGTM!

The import addition is necessary for using the helper functions in the new test.


133-160: LGTM!

The setup correctly configures the STORE_ONLY strategy via deployment args and waits for the rollout to complete before proceeding with tests.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-kubernetes-monitor branch from 112a762 to 674f112 Compare January 16, 2026 18:04
@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-kubernetes-monitor branch from 674f112 to 41cb337 Compare January 19, 2026 10:54
@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-kubernetes-monitor branch from 41cb337 to 93b3c0b Compare January 19, 2026 10:57
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

πŸ€– Fix all issues with AI agents
In `@health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go`:
- Around line 50-55: LoadAllMatches currently loads annotations regardless of
m.processingStrategy, causing stale AnnotationKey entries to be kept when
switching to STORE_ONLY; update LoadAllMatches to ignore/filter out
AnnotationKey annotations when m.processingStrategy ==
pb.ProcessingStrategy_STORE_ONLY (or alternatively add startup logic in the
manager initialization to clear existing AnnotationKey annotations from nodes
when entering STORE_ONLY) so that no previous remediation state is loaded into
memory; reference the LoadAllMatches method, the m.processingStrategy field and
pb.ProcessingStrategy_STORE_ONLY constant and ensure AnnotationKey is excluded
or cleared before any in-memory match state is built.

In `@tests/helpers/kuberntest_object_monitor.go`:
- Around line 37-38: RestoreDeploymentArgs can clear a container's args when
originalArgs is nil; add a defensive nil guard in RestoreDeploymentArgs that
checks if originalArgs == nil and, if so, skip restoring or return early
(mirroring the configMapBackup nil-check pattern) so you don't call
make([]string, len(originalArgs)) and overwrite args with an empty slice;
reference RestoreDeploymentArgs, originalArgs, and SetDeploymentArgs when
locating the fix.
- Around line 11-34: Add package-level godoc and doc comments for the exported
constants K8S_DEPLOYMENT_NAME and K8S_CONTAINER_NAME, the exported type
KubernetesObjectMonitorTestContext, and the exported function
TeardownKubernetesObjectMonitor; each comment should be a one-line sentence
describing the identifier's purpose and follow Go doc comment format. Also
extract the literal "kubernetes-object-monitor" passed to
createConfigMapFromBytes into a new exported constant (e.g., K8S_CONFIGMAP_NAME)
and replace the hard-coded string in TeardownKubernetesObjectMonitor to use that
constant for consistency.
🧹 Nitpick comments (3)
health-monitors/kubernetes-object-monitor/pkg/config/types.go (1)

49-50: Validate/normalize ProcessingStrategy values on load.

This is a free-form string; typos or lowercase values will silently bypass STORE_ONLY logic elsewhere. Consider validating against allowed enum names and normalizing empty/UNSPECIFIED to EXECUTE_REMEDIATION in config parsing, and document the accepted values in the comment. Based on learnings, other components already normalize UNSPECIFIED to EXECUTE_REMEDIATION.

health-monitors/kubernetes-object-monitor/pkg/controller/reconciler.go (1)

217-222: Normalize ProcessingStrategy before string compare.

Line 218 compares a config string to pb.ProcessingStrategy_STORE_ONLY.String(); this is case-sensitive and assumes canonical formatting. Consider normalizing (or parsing once into an enum) so typos/case differences don’t silently re-enable annotation writes.

tests/kubernetes_object_monitor_test.go (1)

136-154: Consider extracting non‑KWOK node selection into a helper.

The same selection loop appears in multiple tests; a small helper would reduce duplication and improve readability.

Comment on lines 50 to 55
if m.processingStrategy == pb.ProcessingStrategy_STORE_ONLY {
slog.Debug("Skipping match state update for STORE_ONLY strategy", "node", nodeName,
"stateKey", stateKey, "targetNode", targetNode)

return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

🧩 Analysis chain

🏁 Script executed:

# First, find the relevant file
find . -type f -name "manager.go" | grep kubernetes-object-monitor

Repository: NVIDIA/NVSentinel

Length of output: 131


🏁 Script executed:

# Get the file structure to understand the repo layout
git ls-files | head -30

Repository: NVIDIA/NVSentinel

Length of output: 1069


🏁 Script executed:

# Read the specific file to understand context around lines 50-74
wc -l health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go

Repository: NVIDIA/NVSentinel

Length of output: 133


🏁 Script executed:

# Read the relevant section of the manager.go file
sed -n '40,90p' health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go

Repository: NVIDIA/NVSentinel

Length of output: 1574


🏁 Script executed:

# Read more of the file to find LoadAllMatches and other methods
sed -n '1,50p' health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go

Repository: NVIDIA/NVSentinel

Length of output: 1667


🏁 Script executed:

# Look for LoadAllMatches method
grep -n "LoadAllMatches" health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go

Repository: NVIDIA/NVSentinel

Length of output: 147


🏁 Script executed:

# Read LoadAllMatches and surrounding methods
sed -n '119,188p' health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go

Repository: NVIDIA/NVSentinel

Length of output: 2020


🏁 Script executed:

# Find GetMatches method
grep -n "GetMatches" health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go

Repository: NVIDIA/NVSentinel

Length of output: 209


🏁 Script executed:

# Search in all files to understand where GetMatches is called and what it does
rg "func.*GetMatches" health-monitors/kubernetes-object-monitor

Repository: NVIDIA/NVSentinel

Length of output: 224


🏁 Script executed:

# Read GetMatches method
sed -n '92,118p' health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go

Repository: NVIDIA/NVSentinel

Length of output: 817


🏁 Script executed:

# Search for STORE_ONLY usage and documentation
rg "STORE_ONLY" health-monitors/kubernetes-object-monitor --context 3

Repository: NVIDIA/NVSentinel

Length of output: 3468


🏁 Script executed:

# Check if there are any comments or documentation about STORE_ONLY mode
rg "STORE_ONLY" . --context 5 | head -100

Repository: NVIDIA/NVSentinel

Length of output: 7694


🏁 Script executed:

# Look for ProcessingStrategy definition to understand STORE_ONLY
rg "ProcessingStrategy" health-monitors/kubernetes-object-monitor -A 5 -B 5 | head -80

Repository: NVIDIA/NVSentinel

Length of output: 7595


🏁 Script executed:

# Search for where LoadAllMatches is called
rg "LoadAllMatches" health-monitors/kubernetes-object-monitor --context 3

Repository: NVIDIA/NVSentinel

Length of output: 1494


🏁 Script executed:

# Also check the reconciler to understand the full flow
sed -n '1,80p' health-monitors/kubernetes-object-monitor/pkg/controller/reconciler.go

Repository: NVIDIA/NVSentinel

Length of output: 2630


🏁 Script executed:

# Search for when LoadState is called and any related cleanup logic
rg "LoadState\|STORE_ONLY" health-monitors/kubernetes-object-monitor/pkg/controller/reconciler.go -B 5 -A 10

Repository: NVIDIA/NVSentinel

Length of output: 43


🏁 Script executed:

# Check if there's any initialization or cleanup logic related to STORE_ONLY
rg "processingStrategy" health-monitors/kubernetes-object-monitor/pkg/controller/reconciler.go -B 3 -A 3

Repository: NVIDIA/NVSentinel

Length of output: 43


🏁 Script executed:

# Look at the full reconciler to understand the flow better
sed -n '60,150p' health-monitors/kubernetes-object-monitor/pkg/controller/reconciler.go

Repository: NVIDIA/NVSentinel

Length of output: 2698


🏁 Script executed:

# Search for any initialization or cleanup that might handle STORE_ONLY annotations
rg "AnnotationKey" health-monitors/kubernetes-object-monitor --context 3

Repository: NVIDIA/NVSentinel

Length of output: 3450


🏁 Script executed:

# Check if there's documentation or comments about STORE_ONLY behavior
rg "STORE_ONLY" . -B 2 -A 2 | grep -E "(TODO|FIXME|NOTE|BUG|comment|//)" -A 2 -B 2

Repository: NVIDIA/NVSentinel

Length of output: 14129


🏁 Script executed:

# Look at the full annotations manager file to see all methods
wc -l health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go

Repository: NVIDIA/NVSentinel

Length of output: 133


🏁 Script executed:

# Check if there's any filtering logic in GetMatches or LoadAllMatches
sed -n '92,140p' health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go

Repository: NVIDIA/NVSentinel

Length of output: 1407


Filter stale annotations in STORE_ONLY mode or clear them on startup.

LoadAllMatches reads all node annotations without checking processingStrategy. If a cluster switches from EXECUTE_REMEDIATION to STORE_ONLY, stale annotations from the previous mode will be loaded into memory and used to emit healthy events on restartβ€”violating STORE_ONLY's observability-only semantics. Either filter out AnnotationKey when loading in STORE_ONLY mode, or clear existing annotations on startup.

πŸ€– Prompt for AI Agents
In `@health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go` around
lines 50 - 55, LoadAllMatches currently loads annotations regardless of
m.processingStrategy, causing stale AnnotationKey entries to be kept when
switching to STORE_ONLY; update LoadAllMatches to ignore/filter out
AnnotationKey annotations when m.processingStrategy ==
pb.ProcessingStrategy_STORE_ONLY (or alternatively add startup logic in the
manager initialization to clear existing AnnotationKey annotations from nodes
when entering STORE_ONLY) so that no previous remediation state is loaded into
memory; reference the LoadAllMatches method, the m.processingStrategy field and
pb.ProcessingStrategy_STORE_ONLY constant and ensure AnnotationKey is excluded
or cleared before any in-memory match state is built.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

πŸ€– Fix all issues with AI agents
In `@tests/kubernetes_object_monitor_test.go`:
- Around line 263-264: The call to RestartDeployment currently ignores its
returned error; update the test to capture the error from
helpers.RestartDeployment(ctx, t, client, helpers.K8S_DEPLOYMENT_NAME,
helpers.NVSentinelNamespace) and handle it immediately (e.g., if err != nil then
t.Fatalf or t.Fatalff with a clear message including the error) so rollout
failures fail the test and the test state isn't left inconsistent; keep the
surrounding context (ctx, t, client, helpers.K8S_DEPLOYMENT_NAME,
helpers.NVSentinelNamespace) unchanged.
🧹 Nitpick comments (1)
tests/kubernetes_object_monitor_test.go (1)

171-198: Add exported-event validation for STORE_ONLY strategy.

These assessments only verify annotation absence. To exercise the new processingStrategy path, also assert the exported event carries STORE_ONLY (and the expected check name) using existing event helper utilities.

Also applies to: 268-295

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-kubernetes-monitor branch 2 times, most recently from 364996a to 1b13327 Compare January 19, 2026 14:10
Signed-off-by: Tanisha goyal <[email protected]>
@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-kubernetes-monitor branch from 1b13327 to c031572 Compare January 19, 2026 14:11
Signed-off-by: Tanisha goyal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant