-
Notifications
You must be signed in to change notification settings - Fork 35
feat: add event handling strategy changes in kubernetes object monitor #639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add event handling strategy changes in kubernetes object monitor #639
Conversation
π WalkthroughWalkthroughThis pull request introduces a new Changes
Estimated code review effortπ― 3 (Moderate) | β±οΈ ~25 minutes Poem
π₯ Pre-merge checks | β 2 | β 1β Failed checks (1 warning)
β Passed checks (2 passed)
βοΈ Tip: You can configure your own custom pre-merge checks in the settings. β¨ Finishing touches
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. π§ golangci-lint (2.5.0)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
π§Ή Nitpick comments (8)
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml (1)
107-112: Well-documented configuration option.The new
processingStrategyfield is properly documented with valid values and behavioral explanations. The defaultEXECUTE_REMEDIATIONmaintains backward compatibility.Consider adding validation in the Helm templates to fail early if an invalid value is provided (e.g., via
.Values.processingStrategy | upper | mustRegexMatch "^(EXECUTE_REMEDIATION|STORE_ONLY)$"), though this is optional if validation happens at the application level.tests/helpers/kube.go (1)
2341-2366: Missing container-not-found check.Unlike
SetDeploymentArgs(lines 2282-2284),RemoveDeploymentArgsdoesn't return an error if the specifiedcontainerNameis not found in the deployment. This inconsistency could hide configuration errors.π Proposed fix
func RemoveDeploymentArgs( ctx context.Context, c klient.Client, deploymentName, namespace, containerName string, args map[string]string, ) error { return retry.RetryOnConflict(retry.DefaultRetry, func() error { deployment := &appsv1.Deployment{} if err := c.Resources().Get(ctx, deploymentName, namespace, deployment); err != nil { return err } if len(deployment.Spec.Template.Spec.Containers) == 0 { return fmt.Errorf("deployment %s/%s has no containers", namespace, deploymentName) } + found := false + for i := range deployment.Spec.Template.Spec.Containers { container := &deployment.Spec.Template.Spec.Containers[i] if containerName != "" && container.Name != containerName { continue } + found = true + removeArgsFromContainer(container, args) } + if containerName != "" && !found { + return fmt.Errorf("container %q not found in deployment %s/%s", containerName, namespace, deploymentName) + } + return c.Resources().Update(ctx, deployment) }) }health-monitors/kubernetes-object-monitor/pkg/config/types.go (2)
49-50: Consider adding validation for ProcessingStrategy values.The
ProcessingStrategyfield is a string without validation, which could allow invalid values to be silently accepted. Consider validating against allowed values (e.g., "EXECUTE_REMEDIATION", "STORE_ONLY") either at config load time or through a custom unmarshal function.
50-50: Add godoc comment for exported field.The exported
ProcessingStrategyfield should have a godoc comment that describes its purpose and allowed values, following Go conventions.As per coding guidelines: "Function comments required for all exported Go functions" (applies to exported fields as well).
health-monitors/kubernetes-object-monitor/main.go (1)
71-75: Consider validating the processing-strategy flag value.The flag accepts any string value but only "EXECUTE_REMEDIATION" and "STORE_ONLY" are valid. Consider adding validation in the
run()function to fail fast with a clear error message if an invalid value is provided.π Suggested validation
func run() error { ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM) defer stop() + // Validate processing strategy + validStrategies := []string{"EXECUTE_REMEDIATION", "STORE_ONLY"} + isValid := false + for _, valid := range validStrategies { + if *processingStrategyFlag == valid { + isValid = true + break + } + } + if !isValid { + return fmt.Errorf("invalid processing-strategy %q, must be one of: %v", *processingStrategyFlag, validStrategies) + } + params := initializer.Params{ PolicyConfigPath: *policyConfigPath,Also applies to: 101-101
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
50-73: Strategy override logic is well-designed.The pattern of defaulting to the publisher's strategy while allowing per-policy overrides is flexible. The validation against the generated
pb.ProcessingStrategy_valuemap ensures consistency with the protobuf definition.One minor observation: consider adding context to the error message (e.g.,
"policy %s: unexpected processingStrategy...") to help identify which policy has the invalid configuration during debugging.π Optional: Add policy context to error message
if policy.HealthEvent.ProcessingStrategy != "" { value, ok := pb.ProcessingStrategy_value[policy.HealthEvent.ProcessingStrategy] if !ok { - return fmt.Errorf("unexpected processingStrategy value: %q", policy.HealthEvent.ProcessingStrategy) + return fmt.Errorf("policy %q: unexpected processingStrategy value: %q", policy.Name, policy.HealthEvent.ProcessingStrategy) }platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
325-343: Consider using Debug log level for filtered events.Using
slog.Infofor every skippedSTORE_ONLYevent could generate high log volume in production. Considerslog.Debugfor consistency with similar skip-logging patterns elsewhere in the codebase (e.g.,manager.golines 51, 70).π Proposed change
for _, healthEvent := range healthEvents.Events { if healthEvent.ProcessingStrategy == protos.ProcessingStrategy_STORE_ONLY { - slog.Info("Skipping STORE_ONLY health event (no node conditions / node events)", + slog.Debug("Skipping STORE_ONLY health event (no node conditions / node events)", "node", healthEvent.NodeName, "checkName", healthEvent.CheckName, "agent", healthEvent.Agent)health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (1)
116-116: Minor: duplicate cast of strategyValue.The
pb.ProcessingStrategy(strategyValue)cast is performed twice (once on line 91 for the publisher, and again on line 116 for registerControllers). Consider storing the cast result in a variable to avoid repetition.π Proposed refactor
+ processingStrategy := pb.ProcessingStrategy(strategyValue) + slog.Info("Event handling strategy configured", "processingStrategy", params.ProcessingStrategy) - pub := publisher.New(pcClient, pb.ProcessingStrategy(strategyValue)) + pub := publisher.New(pcClient, processingStrategy) // ... later ... - if err := registerControllers(mgr, evaluator, pub, cfg.Policies, params.MaxConcurrentReconciles, pb.ProcessingStrategy(strategyValue)); err != nil { + if err := registerControllers(mgr, evaluator, pub, cfg.Policies, params.MaxConcurrentReconciles, processingStrategy); err != nil {
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (26)
data-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yamldistros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yamlevent-exporter/pkg/transformer/cloudevents.goevent-exporter/pkg/transformer/cloudevents_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-quarantine/pkg/initializer/init.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyihealth-monitors/kubernetes-object-monitor/main.gohealth-monitors/kubernetes-object-monitor/pkg/annotations/manager.gohealth-monitors/kubernetes-object-monitor/pkg/config/types.gohealth-monitors/kubernetes-object-monitor/pkg/initializer/initializer.gohealth-monitors/kubernetes-object-monitor/pkg/publisher/publisher.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gostore-client/pkg/client/mongodb_pipeline_builder.gostore-client/pkg/client/pipeline_builder.gostore-client/pkg/client/pipeline_builder_test.gostore-client/pkg/client/postgresql_pipeline_builder.gotests/event_exporter_test.gotests/helpers/event_exporter.gotests/helpers/healthevent.gotests/helpers/kube.gotests/kubernetes_object_monitor_test.go
π§° Additional context used
π Path-based instructions (5)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/kubernetes-object-monitor/pkg/config/types.gotests/helpers/healthevent.gofault-quarantine/pkg/initializer/init.gostore-client/pkg/client/postgresql_pipeline_builder.gotests/event_exporter_test.gostore-client/pkg/client/pipeline_builder.goevent-exporter/pkg/transformer/cloudevents.goevent-exporter/pkg/transformer/cloudevents_test.gotests/kubernetes_object_monitor_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gostore-client/pkg/client/mongodb_pipeline_builder.gohealth-monitors/kubernetes-object-monitor/pkg/publisher/publisher.gohealth-monitors/kubernetes-object-monitor/pkg/initializer/initializer.gostore-client/pkg/client/pipeline_builder_test.gohealth-monitors/kubernetes-object-monitor/main.gotests/helpers/kube.gohealth-monitors/kubernetes-object-monitor/pkg/annotations/manager.gotests/helpers/event_exporter.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/event_exporter_test.goevent-exporter/pkg/transformer/cloudevents_test.gotests/kubernetes_object_monitor_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gostore-client/pkg/client/pipeline_builder_test.go
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml
data-models/protobufs/**/*.proto
π CodeRabbit inference engine (.github/copilot-instructions.md)
data-models/protobufs/**/*.proto: Define Protocol Buffer messages indata-models/protobufs/directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages
Files:
data-models/protobufs/health_event.proto
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
π§ Learnings (8)
π Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.
Applied to files:
fault-quarantine/pkg/initializer/init.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `testify/assert` and `testify/require` for assertions in Go tests
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/event_exporter_test.gotests/kubernetes_object_monitor_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/helpers/kube.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/event_exporter_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Define Protocol Buffer messages in `data-models/protobufs/` directory
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
𧬠Code graph analysis (12)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
ValidateCloudEvent(257-283)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
tests/kubernetes_object_monitor_test.go (4)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
New(41-46)health-monitors/kubernetes-object-monitor/pkg/config/types.go (1)
Config(16-18)tests/helpers/kube.go (9)
SetDeploymentArgs(2255-2288)NVSentinelNamespace(64-64)WaitForDeploymentRollout(960-1101)SetNodeConditionStatus(1709-1770)GetNodeByName(442-451)NeverWaitTimeout(62-62)WaitInterval(63-63)EventuallyWaitTimeout(61-61)RemoveDeploymentArgs(2341-2366)commons/pkg/auditlogger/auditlogger.go (1)
Log(114-134)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
data-models/pkg/protos/health_event.pb.go (16)
HealthEvent(260-280)HealthEvent(293-293)HealthEvent(308-310)Entity(208-214)Entity(227-227)Entity(242-244)RecommendedAction(89-89)RecommendedAction(139-141)RecommendedAction(143-145)RecommendedAction(152-154)ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (3)
data-models/pkg/protos/health_event_grpc.pb.go (1)
PlatformConnectorClient(43-45)health-monitors/kubernetes-object-monitor/pkg/config/types.go (1)
Policy(20-27)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_value(56-59)
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (3)
data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_value(56-59)health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
New(41-46)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(43-45)
store-client/pkg/client/pipeline_builder_test.go (3)
store-client/pkg/client/pipeline_builder.go (1)
PipelineBuilder(26-47)store-client/pkg/client/mongodb_pipeline_builder.go (1)
NewMongoDBPipelineBuilder(29-31)store-client/pkg/client/postgresql_pipeline_builder.go (1)
NewPostgreSQLPipelineBuilder(29-31)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_STORE_ONLY(47-47)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
data-models/pkg/protos/health_event.pb.go (7)
HealthEvent(260-280)HealthEvent(293-293)HealthEvent(308-310)ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
π Additional comments (37)
fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)
263-263: LGTM!The test expectation correctly includes the new
processingStrategyfield withfloat64(0)(representingEXECUTE_REMEDIATION, the default enum value). This aligns with the proto changes whereProcessingStrategydefaults to 0.data-models/protobufs/health_event.proto (2)
32-38: Well-designed enum with correct default.Good proto3 design:
EXECUTE_REMEDIATION = 0ensures backward compatibility since unset fields default to 0 (the normal remediation behavior). The comments clearly document both strategies per coding guidelines.
77-77: Appropriate field placement.Field number 16 continues the sequential numbering after
drainOverrides = 15. This is a non-breaking addition that maintains wire compatibility with existing messages.tests/helpers/kube.go (2)
2208-2249: LGTM!Clean implementation of
WaitForDaemonSetRolloutthat follows the same pattern asWaitForDeploymentRollout. The rollout completion checks (desired, updated, ready pods) are correct.
2290-2336: Approve helper implementation.The
setArgsOnContainerhelper correctly handles the three argument styles (--flag=value,--flag,--flag value) with appropriate insertion and update logic. Thebreakstatements prevent index issues during slice modification.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (2)
14-17: LGTM!The
ProcessingStrategyenum class and top-level constants are correctly defined, matching the protobuf definition.Also applies to: 31-32
78-78: LGTM!The
HealthEventmessage correctly includes the newprocessingStrategyfield in__slots__, field number constant, attribute declaration, and__init__signature. The type hint_Optional[_Union[ProcessingStrategy, str]]is appropriate for protobuf enum fields.Also applies to: 104-104, 120-120, 138-138
tests/event_exporter_test.go (2)
25-25: LGTM!The import addition aligns with the updated test helper function signature.
85-85: LGTM!The test correctly validates the expected processing strategy value. The change properly extends the test to verify the new processingStrategy field in CloudEvents.
event-exporter/pkg/transformer/cloudevents.go (1)
66-66: LGTM!The addition of
processingStrategyto the CloudEvent data payload correctly uses the.String()method to serialize the enum value. The placement is consistent with other health event fields.distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml (1)
51-53: LGTM!The conditional rendering of
processingStrategyfollows the same pattern as theerrorCodefield above (lines 48-50) and properly quotes the value for TOML format. The optional nature ensures backward compatibility when the field is not specified.event-exporter/pkg/transformer/cloudevents_test.go (2)
69-69: LGTM!Setting
ProcessingStrategy_STORE_ONLYprovides good test coverage for the non-default enum value.
106-108: LGTM!The validation correctly verifies that the
processingStrategyfield is properly serialized as "STORE_ONLY" in the CloudEvent data. This confirms the.String()method works as expected.store-client/pkg/client/mongodb_pipeline_builder.go (1)
19-19: LGTM!The import is necessary to reference
ProcessingStrategy_EXECUTE_REMEDIATIONin the new pipeline.distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yaml (1)
57-57: No action needed. TheprocessingStrategyvalue has a default value ofEXECUTE_REMEDIATIONalready defined in values.yaml at line 112.fault-quarantine/pkg/initializer/init.go (1)
66-66: LGTM! Filtering STORE_ONLY events is the intended behavior.The change from
BuildAllHealthEventInsertsPipeline()toBuildProcessableHealthEventInsertsPipeline()correctly enables fault-quarantine to process only events withProcessingStrategy = EXECUTE_REMEDIATION, excluding observability-onlySTORE_ONLYevents. This aligns with the PR objectives for conditional event processing.Producers are properly configured to set
ProcessingStrategybased on policy definitions, withEXECUTE_REMEDIATIONas the default. Observability-only events must be explicitly marked asSTORE_ONLYvia policy configuration, so critical events won't be accidentally filtered.store-client/pkg/client/pipeline_builder_test.go (1)
69-86: LGTM! Test follows established patterns.The new test for
BuildProcessableHealthEventInsertsPipelinecorrectly mirrors the structure of existing pipeline tests, using table-driven approach across both MongoDB and PostgreSQL builders.store-client/pkg/client/pipeline_builder.go (1)
35-38: LGTM! Clear interface addition with good documentation.The new method is well-documented with clear use case and filtering behavior.
tests/helpers/healthevent.go (1)
48-48: LGTM! Builder pattern follows established conventions.The
ProcessingStrategyfield and builder method follow the existing patterns in this test helper. Usinginttype provides flexibility for testing edge cases beyond the defined enum values.Also applies to: 153-156
store-client/pkg/client/postgresql_pipeline_builder.go (1)
119-132: LGTM! Implementation correctly filters for EXECUTE_REMEDIATION strategy.The pipeline matches insert operations where
processingStrategyequalsEXECUTE_REMEDIATION, consistent with the interface contract and MongoDB implementation.platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
1391-1589: Test logic validates STORE_ONLY behavior correctly, but consider migration to envtest.The test effectively validates that
STORE_ONLYevents don't create node conditions or Kubernetes events, whileEXECUTE_REMEDIATIONevents do. Test cases cover the key scenarios well.However, the test uses
fake.NewSimpleClientset()rather thanenvtestfor Kubernetes controller testing. Based on learnings, envtest is preferred for testing Kubernetes controllers.Consider migrating this test to use
envtestin a future refactor to align with the coding guidelines. The fake client is acceptable for unit tests but envtest provides better integration testing.tests/helpers/event_exporter.go (2)
220-254: LGTM! Useful helper for finding events by check name.The new
FindEventByNodeAndCheckNamefunction provides a clean way to search for events by multiple criteria including health status.
261-261: LGTM! Validation extended to include processingStrategy.The
ValidateCloudEventfunction correctly updated to validate theprocessingStrategyfield in CloudEvent payloads. This ensures tests verify the processing strategy is properly propagated through the event pipeline.Also applies to: 281-281
tests/kubernetes_object_monitor_test.go (4)
128-159: LGTM! Test setup correctly configures STORE_ONLY strategy.The setup properly:
- Identifies a non-KWOK test node
- Applies deployment args to enable STORE_ONLY processing
- Waits for deployment rollout before proceeding
161-188: LGTM! Correct use of require.Never for negative assertion.The test correctly validates that
STORE_ONLYstrategy does not create node annotations by usingrequire.Never, which asserts the condition never becomes true within the timeout period.
190-217: Clarify the purpose of the "Node Ready recovery clears annotation" assessment.This assessment expects an annotation to be cleared when the node becomes Ready, but in
STORE_ONLYmode, the annotation should never have been created in the first place (as verified by the previous assessment).Is this assessment intended to verify that:
- The annotation remains absent when the condition changes to True? (If so,
require.Neverwould be more appropriate)- Some edge case where an annotation might exist from a previous non-STORE_ONLY run?
Consider whether this assessment should use
require.Neverinstead ofrequire.Eventuallyto verify the annotation continues to not exist, or add a comment explaining whyEventuallyis appropriate here.
219-230: LGTM! Teardown properly restores deployment state.The teardown correctly removes the STORE_ONLY args and waits for the deployment to stabilize, ensuring test isolation.
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (4)
24-24: LGTM!Import for the protobuf package is correctly added to support the new
ProcessingStrategytype.
38-45: LGTM!The
Managerstruct and constructor are correctly extended to accept and store theprocessingStrategy. The field naming follows Go conventions.
50-53: LGTM!The early return for
STORE_ONLYstrategy correctly prevents annotation updates while maintaining debug logging for observability. This aligns with the PR objective to support different event handling strategies.
69-72: LGTM!Consistent implementation with
AddMatch- theSTORE_ONLYguard correctly skips annotation removal with appropriate debug logging.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)
1-51: Generated protobuf code - no manual review required.This file is auto-generated by the protocol buffer compiler as indicated by the header comments. Ensure this file is regenerated from the source
.protofile rather than manually edited. The changes correctly reflect the addition of theProcessingStrategyenum andprocessingStrategyfield in theHealthEventmessage.health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
36-45: LGTM!The
Publisherstruct and constructor are correctly extended to store and initialize the defaultprocessingStrategy. The naming convention follows Go standards.platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
372-416: LGTM on the overall flow.The
processHealthEventsfunction correctly:
- Filters out
STORE_ONLYevents before processing- Updates node conditions only for processable healthy/fatal events
- Creates Kubernetes events for non-healthy, non-fatal events
The separation of filtering, condition updates, and event creation is clean and maintainable.
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (3)
47-55: LGTM!The
Paramsstruct is correctly extended withProcessingStrategyfield. This allows the strategy to be configured externally (likely via command-line flag or environment variable based on the AI summary).
163-171: LGTM!The
registerControllersfunction signature is correctly updated to acceptprocessingStrategy, and it's properly passed toannotations.NewManager. This ensures the strategy is consistently applied throughout the controller initialization path.
83-91: Default value handling is already provided at the flag level inmain.go.The
processingStrategyFlaghas a default value of"EXECUTE_REMEDIATION"(line 73), which is passed to the initializer via*processingStrategyFlag. This ensures that under normal execution,params.ProcessingStrategywill never be empty. However, the validation at line 84-86 will still correctly reject invalid or empty values if the initializer is called directly with unvalidated input. Consider adding a comment to document this dependency on the caller providing a valid value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (1)
tests/kubernetes_object_monitor_test.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/kubernetes_object_monitor_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/kubernetes_object_monitor_test.go
π§ Learnings (2)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/kubernetes_object_monitor_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/kubernetes_object_monitor_test.go
𧬠Code graph analysis (1)
tests/kubernetes_object_monitor_test.go (4)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
New(41-46)tests/helpers/kube.go (7)
SetDeploymentArgs(2255-2288)NVSentinelNamespace(64-64)WaitForDeploymentRollout(960-1101)SetNodeConditionStatus(1709-1770)GetNodeByName(442-451)EventuallyWaitTimeout(61-61)WaitInterval(63-63)tests/helpers/fault_quarantine.go (2)
AssertQuarantineState(315-382)QuarantineAssertion(56-60)tests/helpers/event_exporter.go (3)
GetMockEvents(36-99)FindEventByNodeAndCheckName(221-254)ValidateCloudEvent(257-283)
π Additional comments (1)
tests/kubernetes_object_monitor_test.go (1)
24-24: LGTM!The import is correctly added to support the helper functions used in the new test.
f6e73ed to
8fc966f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
β»οΈ Duplicate comments (1)
tests/kubernetes_object_monitor_test.go (1)
191-218: Redundant annotation clearing check.This assess step waits for the annotation to be cleared, but the previous assess (lines 171-187) already verified that the annotation is never set when using the STORE_ONLY strategy. Waiting for an annotation to be cleared when it was never created is unnecessary and adds confusion to the test flow.
Consider removing this entire assess step, or if you want to keep it for defensive reasons, add a comment explaining why this check is necessary despite the earlier assertion that the annotation is never set.
π Suggested simplification
- feature.Assess("Node Ready recovery clears annotation", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { - client, err := c.NewClient() - require.NoError(t, err) - - nodeName := ctx.Value(k8sMonitorKeyNodeName).(string) - t.Logf("Setting TestCondition to True on node %s", nodeName) - - helpers.SetNodeConditionStatus(ctx, t, client, nodeName, v1.NodeConditionType(testConditionType), v1.ConditionTrue) - - t.Log("Waiting for policy match annotation to be cleared") - require.Eventually(t, func() bool { - node, err := helpers.GetNodeByName(ctx, client, nodeName) - if err != nil { - t.Logf("Failed to get node: %v", err) - return false - } - - annotation, exists := node.Annotations[annotationKey] - if exists && annotation != "" { - t.Logf("Annotation still exists: %s", annotation) - return false - } - - return true - }, helpers.EventuallyWaitTimeout, helpers.WaitInterval) - - return ctx - }) -
π§Ή Nitpick comments (4)
tests/helpers/kube.go (1)
387-409: LGTM! Function correctly implements negative event assertion.The logic properly uses
require.Neverto ensure the specified event type and reason do not appear on the node.Optional: Consider improving log message clarity
At line 405, the log message could include both
eventTypeandeventReasonfor consistency with the assertion message at line 408:- t.Logf("node %s does not have event %v", nodeName, eventType) + t.Logf("node %s does not have event type=%s reason=%s", nodeName, eventType, eventReason)tests/platform-connector_test.go (1)
28-32: Remove unused struct fields.The
ConfigMapBackupandTestNamespacefields are declared but never used anywhere in the test. Consider removing them to keep the code clean.π Proposed cleanup
type PlatformConnectorTestContext struct { NodeName string - ConfigMapBackup []byte - TestNamespace string }health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (2)
84-87: Consider case-insensitive processing strategy validation.The validation is case-sensitive and requires an exact match with the proto enum names ("EXECUTE_REMEDIATION" or "STORE_ONLY"). This could lead to user confusion if they provide lowercase values like "store_only".
Consider normalizing the input with
strings.ToUpper()before validation, or enhance the error message to list the valid values.π Suggested improvement
- strategyValue, ok := pb.ProcessingStrategy_value[params.ProcessingStrategy] + strategyValue, ok := pb.ProcessingStrategy_value[strings.ToUpper(params.ProcessingStrategy)] if !ok { - return nil, fmt.Errorf("unexpected processingStrategy value: %q", params.ProcessingStrategy) + return nil, fmt.Errorf("unexpected processingStrategy value: %q (valid values: EXECUTE_REMEDIATION, STORE_ONLY)", params.ProcessingStrategy) }
54-54: Document the ProcessingStrategy field.The new
ProcessingStrategyfield in theParamsstruct would benefit from a comment explaining its purpose and valid values, especially since it's part of a public API.MaxConcurrentReconciles int PlatformConnectorSocket string + // ProcessingStrategy determines how health events are processed. + // Valid values: "EXECUTE_REMEDIATION" (default behavior with remediation actions) + // "STORE_ONLY" (events are stored but no remediation is triggered) ProcessingStrategy string }
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (19)
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yamldistros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yamldistros/kubernetes/nvsentinel/values-tilt-postgresql.yamldocs/postgresql-schema.sqlfault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-quarantine/pkg/initializer/init.gohealth-monitors/kubernetes-object-monitor/main.gohealth-monitors/kubernetes-object-monitor/pkg/annotations/manager.gohealth-monitors/kubernetes-object-monitor/pkg/config/types.gohealth-monitors/kubernetes-object-monitor/pkg/initializer/initializer.gohealth-monitors/kubernetes-object-monitor/pkg/publisher/publisher.gotests/event_exporter_test.gotests/fault_quarantine_test.gotests/helpers/event_exporter.gotests/helpers/healthevent.gotests/helpers/kube.gotests/kubernetes_object_monitor_test.gotests/platform-connector_test.go
π§ Files skipped from review as they are similar to previous changes (7)
- fault-quarantine/pkg/evaluator/rule_evaluator_test.go
- distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yaml
- health-monitors/kubernetes-object-monitor/main.go
- health-monitors/kubernetes-object-monitor/pkg/config/types.go
- tests/event_exporter_test.go
- health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
- health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
π§° Additional context used
π Path-based instructions (3)
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/helpers/healthevent.gotests/fault_quarantine_test.gotests/helpers/event_exporter.gofault-quarantine/pkg/initializer/init.gotests/kubernetes_object_monitor_test.gohealth-monitors/kubernetes-object-monitor/pkg/initializer/initializer.gotests/helpers/kube.gotests/platform-connector_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/fault_quarantine_test.gotests/kubernetes_object_monitor_test.gotests/platform-connector_test.go
π§ Learnings (8)
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yamltests/helpers/event_exporter.gotests/platform-connector_test.godistros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
tests/fault_quarantine_test.gotests/kubernetes_object_monitor_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/fault_quarantine_test.gotests/kubernetes_object_monitor_test.gotests/platform-connector_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/fault_quarantine_test.gotests/kubernetes_object_monitor_test.gotests/helpers/kube.gotests/platform-connector_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/fault_quarantine_test.go
π Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.
Applied to files:
fault-quarantine/pkg/initializer/init.go
π Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.gotests/platform-connector_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/platform-connector_test.go
𧬠Code graph analysis (5)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
tests/fault_quarantine_test.go (4)
tests/helpers/fault_quarantine.go (4)
QuarantineTestContext(51-54)SetupQuarantineTest(107-112)AssertQuarantineState(315-382)QuarantineAssertion(56-60)tests/helpers/kube.go (1)
SetNodeManagedByNVSentinel(1389-1408)tests/helpers/healthevent.go (3)
NewHealthEvent(60-76)SendHealthEvent(263-275)SendHealthyEvent(277-287)data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
tests/kubernetes_object_monitor_test.go (1)
tests/helpers/kube.go (9)
SetDeploymentArgs(2279-2312)NVSentinelNamespace(64-64)WaitForDeploymentRollout(984-1125)SetNodeConditionStatus(1733-1794)GetNodeByName(466-475)NeverWaitTimeout(62-62)WaitInterval(63-63)EventuallyWaitTimeout(61-61)RemoveDeploymentArgs(2365-2390)
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (3)
data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_value(56-59)health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
New(41-46)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(43-45)
tests/platform-connector_test.go (4)
tests/helpers/kube.go (5)
SelectTestNodeFromUnusedPool(413-437)EnsureNodeConditionNotPresent(1797-1818)EnsureNodeEventNotPresent(387-409)CheckNodeConditionExists(1283-1300)CheckNodeEventExists(1325-1353)tests/helpers/healthevent.go (3)
NewHealthEvent(60-76)SendHealthEvent(263-275)SendHealthyEvent(277-287)tests/helpers/health_events_analyzer.go (2)
ERRORCODE_79(35-35)ERRORCODE_31(32-32)data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
π Additional comments (16)
tests/helpers/kube.go (3)
2232-2273: LGTM! DaemonSet rollout verification is correct.The function properly checks all relevant status fields (DesiredNumberScheduled, UpdatedNumberScheduled, NumberReady) to verify rollout completion, with appropriate logging at each stage.
2275-2360: LGTM! Deployment args manipulation is well-implemented.The function correctly uses
retry.RetryOnConflictwithout wrapping errors (as per coding guidelines), andsetArgsOnContainerproperly handles the three argument formats:--flag=value,--flag value, and--flag(boolean).
2362-2417: LGTM! Argument removal logic is correct.The docstring correctly states "removes container arguments" and the implementation properly handles removal of arguments in different formats (--flag=value, --flag value, --flag). The function correctly uses
retry.RetryOnConflictwithout wrapping errors.Note: The past review comment about an incorrect docstring referencing environment variables appears to have been resolved in the current code.
tests/helpers/healthevent.go (2)
48-48: LGTM!The
ProcessingStrategyfield is correctly defined with an appropriate type and JSON tag, consistent with other optional fields in the struct.
153-156: LGTM!The builder method follows the established pattern and enables fluent chaining, consistent with all other
With*methods in this helper.tests/helpers/event_exporter.go (2)
220-254: LGTM!The
FindEventByNodeAndCheckNamefunction correctly implements CloudEvent search by nodeName, checkName, and isHealthy status, following the same defensive pattern as the existingFindEventByNodeAndMessagefunction.
256-283: LGTM!The
ValidateCloudEventfunction is correctly extended to validate theprocessingStrategyfield, following the same assertion pattern as other health event fields.distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml (1)
219-222: LGTM!The
processing_strategycolumn is correctly added to thehealth_eventstable schema with an appropriate VARCHAR(50) type. The column is nullable, which aligns with the optional nature of this field in the proto definition and JSON serialization.fault-quarantine/pkg/initializer/init.go (1)
66-66: LGTM!Correctly switches to
BuildProcessableHealthEventInsertsPipeline()to ensure fault-quarantine only processesEXECUTE_REMEDIATIONevents, filtering outSTORE_ONLYobservability events.docs/postgresql-schema.sql (1)
106-109: LGTM!The
processing_strategycolumn is correctly added to the canonical PostgreSQL schema and matches the corresponding change invalues-tilt-postgresql.yaml, maintaining consistency between the two schema definitions as required.distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml (1)
51-53: LGTM!The
processingStrategyfield is correctly rendered in the ConfigMap template with proper conditional logic, indentation, and quoting, following the same pattern as theerrorCodefield.tests/fault_quarantine_test.go (4)
26-26: LGTM!The import of the
protospackage is necessary to reference theProcessingStrategyenum constants used in the test.
233-250: LGTM!The test setup correctly initializes the test context and marks the node as managed by NVSentinel, which is required for the quarantine behavior being tested.
252-286: LGTM!The test assessments correctly verify that
STORE_ONLYevents do not trigger quarantine whileEXECUTE_REMEDIATIONevents do, effectively validating the core behavior of the processing strategy feature.
288-294: LGTM!The teardown properly cleans up by sending a healthy event and calling the standard teardown helper.
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml (1)
107-112: LGTM!The
processingStrategyfield is well-documented with clear explanations of both modes and sensible defaults. The default valueEXECUTE_REMEDIATIONmaintains backward compatibility while enabling the new observability-only mode when needed.
8fc966f to
265ed4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
π€ Fix all issues with AI agents
In @tests/platform-connector_test.go:
- Around line 82-93: The test is calling CheckNodeConditionExists and
CheckNodeEventExists but ignoring their return values, so assertions are
ineffective; update the test to capture each call's boolean/result and use
t.Fatalf/t.Errorf or require/assert to fail the test when the check returns
false (e.g., assign condOk := helpers.CheckNodeConditionExists(...); if !condOk
{ t.Fatalf("expected condition SysLogsXIDError on %s", testCtx.NodeName) }) and
do the same for the event call (eventOk := helpers.CheckNodeEventExists(...)) so
failures are reported and the test actually validates the expected state.
π§Ή Nitpick comments (2)
tests/platform-connector_test.go (1)
28-32: Remove unused struct fields.
ConfigMapBackupandTestNamespaceare defined but never used in this test. Consider removing them to avoid confusion.β»οΈ Suggested refactor
type PlatformConnectorTestContext struct { NodeName string - ConfigMapBackup []byte - TestNamespace string }tests/kubernetes_object_monitor_test.go (1)
191-218: Test assertion is trivially satisfied.Since the previous
Assessblock verified that the annotation was never created withSTORE_ONLY, this assertion for "annotation cleared" will pass trivially because the annotation doesn't exist to begin with. Consider either:
- Renaming the assess description to "Node Ready maintains no annotation"
- Adding explicit logging to clarify the expected behavior
This doesn't affect test correctness but could be misleading when reviewing test results.
β»οΈ Suggested clarification
- feature.Assess("Node Ready recovery clears annotation", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { + feature.Assess("Node Ready maintains no annotation under STORE_ONLY", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (16)
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yamldistros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yamldistros/kubernetes/nvsentinel/values-tilt-postgresql.yamldocs/postgresql-schema.sqlhealth-monitors/kubernetes-object-monitor/main.gohealth-monitors/kubernetes-object-monitor/pkg/annotations/manager.gohealth-monitors/kubernetes-object-monitor/pkg/config/types.gohealth-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.gohealth-monitors/kubernetes-object-monitor/pkg/initializer/initializer.gohealth-monitors/kubernetes-object-monitor/pkg/publisher/publisher.gotests/event_exporter_test.gotests/helpers/event_exporter.gotests/helpers/kube.gotests/kubernetes_object_monitor_test.gotests/platform-connector_test.go
π§ Files skipped from review as they are similar to previous changes (9)
- tests/event_exporter_test.go
- health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
- docs/postgresql-schema.sql
- health-monitors/kubernetes-object-monitor/main.go
- tests/helpers/event_exporter.go
- distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml
- distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml
- distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml
- distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yaml
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.gohealth-monitors/kubernetes-object-monitor/pkg/config/types.gohealth-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.gohealth-monitors/kubernetes-object-monitor/pkg/initializer/initializer.gotests/helpers/kube.gotests/platform-connector_test.gotests/kubernetes_object_monitor_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.gotests/platform-connector_test.gotests/kubernetes_object_monitor_test.go
π§ Learnings (9)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/config/types.gotests/platform-connector_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.gotests/kubernetes_object_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.gotests/platform-connector_test.gotests/kubernetes_object_monitor_test.go
π Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.gotests/platform-connector_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/platform-connector_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
tests/kubernetes_object_monitor_test.go
𧬠Code graph analysis (4)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (3)
data-models/pkg/protos/health_event_grpc.pb.go (1)
PlatformConnectorClient(43-45)health-monitors/kubernetes-object-monitor/pkg/config/types.go (1)
Policy(20-27)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_value(59-63)
health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go (2)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(43-45)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(48-48)
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (3)
data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)ProcessingStrategy_value(59-63)health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
New(41-46)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(43-45)
tests/platform-connector_test.go (4)
tests/helpers/kube.go (5)
SelectTestNodeFromUnusedPool(413-437)EnsureNodeConditionNotPresent(1797-1818)EnsureNodeEventNotPresent(387-409)CheckNodeConditionExists(1283-1300)CheckNodeEventExists(1325-1353)tests/helpers/healthevent.go (3)
NewHealthEvent(60-76)SendHealthEvent(263-275)SendHealthyEvent(277-287)tests/helpers/health_events_analyzer.go (2)
ERRORCODE_79(35-35)ERRORCODE_31(32-32)data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(49-49)ProcessingStrategy_EXECUTE_REMEDIATION(48-48)
π Additional comments (21)
health-monitors/kubernetes-object-monitor/pkg/config/types.go (1)
49-50: LGTM!The new
ProcessingStrategyfield inHealthEventSpecallows per-policy override of the global processing strategy. The string type is appropriate for TOML configuration and will be validated/mapped to the protobuf enum in the publisher.tests/platform-connector_test.go (1)
34-51: LGTM on test setup structure.The test setup correctly selects a test node from the unused pool and stores it in the context for use in subsequent assess/teardown phases.
tests/helpers/kube.go (4)
2232-2273: LGTM!
WaitForDaemonSetRolloutis well-implemented with proper checks forDesiredNumberScheduled,UpdatedNumberScheduled, andNumberReady. Good use oft.Helper()and descriptive logging.
2275-2337: LGTM!
SetDeploymentArgsandsetArgsOnContainercorrectly handle both--flag=valueand--flag valuestyles. The use ofretry.RetryOnConflictensures safe updates under concurrent access. As per coding guidelines, errors are returned without wrapping within the retry block to preserve retry behavior.
2339-2369: LGTM!
tryUpdateExistingArghandles the complexity of updating args in both--flag=valueand--flag valuestyles correctly. The slice insertion logic at line 2361 properly handles the case where a value needs to be inserted after a standalone flag.
2371-2433: LGTM!
RemoveDeploymentArgsandremoveArgsFromContainercorrectly handle removal of both arg styles. The logic properly removes the flag and its associated value when using--flag valuestyle (lines 2423-2424).health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (2)
36-46: LGTM!The
Publisherstruct now properly stores theprocessingStrategyand the constructor correctly initializes it. This aligns with the initialization flow that propagates the strategy from CLI flags through to the publisher.
48-74: LGTM!The strategy resolution logic is well-designed:
- Uses the publisher's default strategy
- Allows per-policy override via
policy.HealthEvent.ProcessingStrategy- Validates override values against the protobuf enum map
- Fails fast with a clear error for invalid values
The
ProcessingStrategyis correctly included in theHealthEventpayload.health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (4)
54-54: LGTM!Adding
ProcessingStrategytoParamsallows the CLI flag value to be propagated through initialization.
83-91: LGTM!The validation logic correctly:
- Looks up the strategy string in
pb.ProcessingStrategy_value- Returns a clear error for invalid values
- Logs the configured strategy for observability
- Creates the publisher with the validated strategy value
116-117: LGTM!The processing strategy is correctly passed to
registerControllers, ensuring the annotation manager receives the same strategy configuration.
170-172: LGTM!The
registerControllersfunction signature is updated to acceptprocessingStrategyand correctly passes it toannotations.NewManager. This ensures the annotation manager respects the configured strategy when deciding whether to update/remove annotations.tests/kubernetes_object_monitor_test.go (4)
24-25: LGTM!Import added correctly for the helpers package.
128-160: LGTM!The test setup correctly:
- Finds a non-KWOK node for testing
- Sets the
--processing-strategy=STORE_ONLYdeployment arg- Waits for the deployment rollout to complete
This ensures the kubernetes-object-monitor is running with the STORE_ONLY strategy before assertions.
162-189: LGTM!Good use of
require.Neverto assert that withSTORE_ONLYstrategy, no policy match annotation is applied when the node condition becomes unhealthy. This validates the core STORE_ONLY behavior.
220-232: LGTM!Proper teardown that removes the
--processing-strategyarg and waits for deployment rollout, restoring the original state for subsequent tests.health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go (5)
37-37: LGTM!Correct import of the protos package to access
ProcessingStrategy_EXECUTE_REMEDIATION.
576-576: LGTM!Updated to pass
protos.ProcessingStrategy_EXECUTE_REMEDIATIONtoannotations.NewManager, matching the new constructor signature. UsingEXECUTE_REMEDIATIONpreserves the expected behavior for existing tests.
631-631: LGTM!Consistent with the other setup functions.
739-739: LGTM!Consistent update in
restartReconciler.
777-777: LGTM!Consistent update in
restartReconcilerWithCRD. All test setup paths now correctly pass the processing strategy to the annotation manager.
265ed4b to
112a762
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
π§Ή Nitpick comments (4)
tests/helpers/kube.go (1)
2339-2369: Consider simplifying the--flag valueupdate logic.The slice insertion at line 2361 is complex. When the flag exists without a value (e.g.,
--verbose) and you want to add a value, inserting into the middle of a slice withappend(container.Args[:j+1], append([]string{value}, container.Args[j+1:]...)...)works but is hard to follow.Also, at line 2358-2359, if the next argument starts with
-, you insert the value after the flag. However, if--flagwas originally a boolean flag (no value), this insertion may be unintended.β»οΈ Suggested simplification
Consider documenting the expected behavior more explicitly or simplifying by always using
--flag=valuestyle when updating:// Match --flag or --flag value style if existingArg == flag { if value != "" { if j+1 < len(container.Args) && !strings.HasPrefix(container.Args[j+1], "-") { + // Update existing separate value container.Args[j+1] = value } else { - container.Args = append(container.Args[:j+1], append([]string{value}, container.Args[j+1:]...)...) + // Convert to --flag=value style for simplicity + container.Args[j] = flag + "=" + value } } return true }tests/kubernetes_object_monitor_test.go (3)
162-189: Test name doesn't match assertion.The assess title says "triggers health event" but the test only verifies that annotations are NOT applied (STORE_ONLY behavior). Consider renaming to better reflect what's being tested, e.g., "Node NotReady does not apply annotation with STORE_ONLY strategy".
π Suggested rename
- feature.Assess("Node NotReady triggers health event with STORE_ONLY strategy", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { + feature.Assess("Node NotReady does not apply annotation with STORE_ONLY strategy", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
191-218: Misleading test semantics in STORE_ONLY context.In STORE_ONLY mode, the annotation is never applied (validated by the previous assess). This test checking "annotation to be cleared" is semantically misleading since there's nothing to clear. The test will pass immediately because the annotation doesn't exist.
Consider either:
- Removing this assess since it's redundant in STORE_ONLY mode, or
- Renaming to clarify intent, e.g., "Node Ready recovery keeps annotation absent with STORE_ONLY strategy"
π Suggested rename if keeping the test
- feature.Assess("Node Ready recovery clears annotation", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { + feature.Assess("Node Ready recovery keeps annotation absent with STORE_ONLY strategy", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {- t.Log("Waiting for policy match annotation to be cleared") + t.Log("Verifying policy match annotation remains absent")
220-232: Consider adding node condition cleanup to teardown.The teardown correctly restores the deployment args. However, if the test fails midway (e.g., after setting TestCondition to False but before recovery), the node condition might be left in an unhealthy state, potentially affecting other tests.
π Suggested addition for robustness
feature.Teardown(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { client, err := c.NewClient() require.NoError(t, err) + // Ensure node condition is restored to healthy state + nodeName := ctx.Value(k8sMonitorKeyNodeName).(string) + helpers.SetNodeConditionStatus(ctx, t, client, nodeName, v1.NodeConditionType(testConditionType), v1.ConditionTrue) + err = helpers.RemoveDeploymentArgs(ctx, client, "kubernetes-object-monitor", helpers.NVSentinelNamespace, "", map[string]string{ "--processing-strategy": "STORE_ONLY", }) require.NoError(t, err) helpers.WaitForDeploymentRollout(ctx, t, client, "kubernetes-object-monitor", helpers.NVSentinelNamespace) return ctx })
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (6)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.gohealth-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.gohealth-monitors/kubernetes-object-monitor/pkg/initializer/initializer.gotests/helpers/event_exporter.gotests/helpers/kube.gotests/kubernetes_object_monitor_test.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/kubernetes_object_monitor_test.gohealth-monitors/kubernetes-object-monitor/pkg/annotations/manager.gotests/helpers/event_exporter.gohealth-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.gohealth-monitors/kubernetes-object-monitor/pkg/initializer/initializer.gotests/helpers/kube.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/kubernetes_object_monitor_test.gohealth-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
π§ Learnings (8)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
tests/kubernetes_object_monitor_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/kubernetes_object_monitor_test.gohealth-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/kubernetes_object_monitor_test.gohealth-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.gohealth-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Define Protocol Buffer messages in `data-models/protobufs/` directory
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
tests/helpers/event_exporter.go
π Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
𧬠Code graph analysis (3)
tests/kubernetes_object_monitor_test.go (1)
tests/helpers/kube.go (8)
SetDeploymentArgs(2279-2313)NVSentinelNamespace(64-64)WaitForDeploymentRollout(984-1125)SetNodeConditionStatus(1733-1794)GetNodeByName(466-475)NeverWaitTimeout(62-62)WaitInterval(63-63)EventuallyWaitTimeout(61-61)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)ProcessingStrategy_STORE_ONLY(49-49)
health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go (1)
data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(48-48)
π Additional comments (17)
tests/helpers/event_exporter.go (2)
220-255: LGTM!The new
FindEventByNodeAndCheckNamefunction follows the established pattern ofFindEventByNodeAndMessageand correctly searches for events matchingnodeName,checkName, andisHealthystatus. The type assertions and nil checks are consistent with the existing code.
257-283: LGTM!The
ValidateCloudEventfunction correctly validates the newprocessingStrategyfield within thehealthEventpayload. The addition aligns with the PR's objective to propagate processing strategy through CloudEvents.health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (4)
24-24: LGTM!Import alias
pbfor the protos package follows consistent naming conventions used elsewhere in the codebase.
38-45: LGTM!The
Managerstruct extension and constructor update correctly propagate theprocessingStrategyconfiguration. The field is properly typed aspb.ProcessingStrategyenum.
47-65: LGTM!The guard in
AddMatchcorrectly prevents annotation updates whenSTORE_ONLYstrategy is configured. The debug logging provides good observability for troubleshooting.
68-90: LGTM!The guard in
RemoveMatchmirrors theAddMatchguard, correctly skipping annotation removal forSTORE_ONLYstrategy. The implementation is consistent and the logging is helpful.health-monitors/kubernetes-object-monitor/pkg/controller/reconciler_test.go (2)
37-37: LGTM!Import for the
protospackage correctly added to support theProcessingStrategyenum usage.
576-576: LGTM!All test setup functions (
setupTestWithPolicies,setupTestWithCRD,restartReconciler,restartReconcilerWithCRD) consistently passprotos.ProcessingStrategy_EXECUTE_REMEDIATIONtoannotations.NewManager. This ensures existing tests continue to verify annotation modification behavior.Also applies to: 631-631, 739-739, 777-777
tests/helpers/kube.go (3)
2231-2273: LGTM!
WaitForDaemonSetRolloutcorrectly implements rollout completion detection by verifying thatUpdatedNumberScheduledandNumberReadyboth matchDesiredNumberScheduled. The progress logging is helpful for debugging test failures.
2275-2313: LGTM!
SetDeploymentArgsfollows the established pattern ofSetDeploymentEnvVarswith proper retry-on-conflict handling and container targeting logic.
2409-2433: LGTM!
removeArgsFromContainercorrectly handles both--flag=valueand--flag valueargument styles. The slice manipulation for removing 1 or 2 elements is correct.health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (4)
47-55: LGTM!The
ProcessingStrategyfield inParamscorrectly captures the CLI flag value as a string for validation.
83-91: LGTM!The validation uses the protobuf-generated
pb.ProcessingStrategy_valuemap, which ensures only valid enum values are accepted. The error message is descriptive. The type conversionpb.ProcessingStrategy(strategyValue)is safe sincestrategyValueis obtained from the valid enum value map.
116-120: LGTM!The
processingStrategyis correctly propagated toregisterControllers, maintaining consistency across the initialization flow.
164-172: LGTM!The
registerControllersfunction correctly receives theprocessingStrategyparameter and passes it toannotations.NewManager. The annotation manager is appropriately shared across all reconcilers for a given GVK.tests/kubernetes_object_monitor_test.go (2)
24-25: LGTM!The import addition is necessary for using the helper functions in the new test.
133-160: LGTM!The setup correctly configures the STORE_ONLY strategy via deployment args and waits for the rollout to complete before proceeding with tests.
112a762 to
674f112
Compare
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
674f112 to
41cb337
Compare
Signed-off-by: Tanisha goyal <[email protected]>
41cb337 to
93b3c0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
π€ Fix all issues with AI agents
In `@health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go`:
- Around line 50-55: LoadAllMatches currently loads annotations regardless of
m.processingStrategy, causing stale AnnotationKey entries to be kept when
switching to STORE_ONLY; update LoadAllMatches to ignore/filter out
AnnotationKey annotations when m.processingStrategy ==
pb.ProcessingStrategy_STORE_ONLY (or alternatively add startup logic in the
manager initialization to clear existing AnnotationKey annotations from nodes
when entering STORE_ONLY) so that no previous remediation state is loaded into
memory; reference the LoadAllMatches method, the m.processingStrategy field and
pb.ProcessingStrategy_STORE_ONLY constant and ensure AnnotationKey is excluded
or cleared before any in-memory match state is built.
In `@tests/helpers/kuberntest_object_monitor.go`:
- Around line 37-38: RestoreDeploymentArgs can clear a container's args when
originalArgs is nil; add a defensive nil guard in RestoreDeploymentArgs that
checks if originalArgs == nil and, if so, skip restoring or return early
(mirroring the configMapBackup nil-check pattern) so you don't call
make([]string, len(originalArgs)) and overwrite args with an empty slice;
reference RestoreDeploymentArgs, originalArgs, and SetDeploymentArgs when
locating the fix.
- Around line 11-34: Add package-level godoc and doc comments for the exported
constants K8S_DEPLOYMENT_NAME and K8S_CONTAINER_NAME, the exported type
KubernetesObjectMonitorTestContext, and the exported function
TeardownKubernetesObjectMonitor; each comment should be a one-line sentence
describing the identifier's purpose and follow Go doc comment format. Also
extract the literal "kubernetes-object-monitor" passed to
createConfigMapFromBytes into a new exported constant (e.g., K8S_CONFIGMAP_NAME)
and replace the hard-coded string in TeardownKubernetesObjectMonitor to use that
constant for consistency.
π§Ή Nitpick comments (3)
health-monitors/kubernetes-object-monitor/pkg/config/types.go (1)
49-50: Validate/normalizeProcessingStrategyvalues on load.This is a free-form string; typos or lowercase values will silently bypass STORE_ONLY logic elsewhere. Consider validating against allowed enum names and normalizing empty/
UNSPECIFIEDtoEXECUTE_REMEDIATIONin config parsing, and document the accepted values in the comment. Based on learnings, other components already normalizeUNSPECIFIEDtoEXECUTE_REMEDIATION.health-monitors/kubernetes-object-monitor/pkg/controller/reconciler.go (1)
217-222: NormalizeProcessingStrategybefore string compare.Line 218 compares a config string to
pb.ProcessingStrategy_STORE_ONLY.String(); this is case-sensitive and assumes canonical formatting. Consider normalizing (or parsing once into an enum) so typos/case differences donβt silently re-enable annotation writes.tests/kubernetes_object_monitor_test.go (1)
136-154: Consider extracting nonβKWOK node selection into a helper.The same selection loop appears in multiple tests; a small helper would reduce duplication and improve readability.
| if m.processingStrategy == pb.ProcessingStrategy_STORE_ONLY { | ||
| slog.Debug("Skipping match state update for STORE_ONLY strategy", "node", nodeName, | ||
| "stateKey", stateKey, "targetNode", targetNode) | ||
|
|
||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π§© Analysis chain
π Script executed:
# First, find the relevant file
find . -type f -name "manager.go" | grep kubernetes-object-monitorRepository: NVIDIA/NVSentinel
Length of output: 131
π Script executed:
# Get the file structure to understand the repo layout
git ls-files | head -30Repository: NVIDIA/NVSentinel
Length of output: 1069
π Script executed:
# Read the specific file to understand context around lines 50-74
wc -l health-monitors/kubernetes-object-monitor/pkg/annotations/manager.goRepository: NVIDIA/NVSentinel
Length of output: 133
π Script executed:
# Read the relevant section of the manager.go file
sed -n '40,90p' health-monitors/kubernetes-object-monitor/pkg/annotations/manager.goRepository: NVIDIA/NVSentinel
Length of output: 1574
π Script executed:
# Read more of the file to find LoadAllMatches and other methods
sed -n '1,50p' health-monitors/kubernetes-object-monitor/pkg/annotations/manager.goRepository: NVIDIA/NVSentinel
Length of output: 1667
π Script executed:
# Look for LoadAllMatches method
grep -n "LoadAllMatches" health-monitors/kubernetes-object-monitor/pkg/annotations/manager.goRepository: NVIDIA/NVSentinel
Length of output: 147
π Script executed:
# Read LoadAllMatches and surrounding methods
sed -n '119,188p' health-monitors/kubernetes-object-monitor/pkg/annotations/manager.goRepository: NVIDIA/NVSentinel
Length of output: 2020
π Script executed:
# Find GetMatches method
grep -n "GetMatches" health-monitors/kubernetes-object-monitor/pkg/annotations/manager.goRepository: NVIDIA/NVSentinel
Length of output: 209
π Script executed:
# Search in all files to understand where GetMatches is called and what it does
rg "func.*GetMatches" health-monitors/kubernetes-object-monitorRepository: NVIDIA/NVSentinel
Length of output: 224
π Script executed:
# Read GetMatches method
sed -n '92,118p' health-monitors/kubernetes-object-monitor/pkg/annotations/manager.goRepository: NVIDIA/NVSentinel
Length of output: 817
π Script executed:
# Search for STORE_ONLY usage and documentation
rg "STORE_ONLY" health-monitors/kubernetes-object-monitor --context 3Repository: NVIDIA/NVSentinel
Length of output: 3468
π Script executed:
# Check if there are any comments or documentation about STORE_ONLY mode
rg "STORE_ONLY" . --context 5 | head -100Repository: NVIDIA/NVSentinel
Length of output: 7694
π Script executed:
# Look for ProcessingStrategy definition to understand STORE_ONLY
rg "ProcessingStrategy" health-monitors/kubernetes-object-monitor -A 5 -B 5 | head -80Repository: NVIDIA/NVSentinel
Length of output: 7595
π Script executed:
# Search for where LoadAllMatches is called
rg "LoadAllMatches" health-monitors/kubernetes-object-monitor --context 3Repository: NVIDIA/NVSentinel
Length of output: 1494
π Script executed:
# Also check the reconciler to understand the full flow
sed -n '1,80p' health-monitors/kubernetes-object-monitor/pkg/controller/reconciler.goRepository: NVIDIA/NVSentinel
Length of output: 2630
π Script executed:
# Search for when LoadState is called and any related cleanup logic
rg "LoadState\|STORE_ONLY" health-monitors/kubernetes-object-monitor/pkg/controller/reconciler.go -B 5 -A 10Repository: NVIDIA/NVSentinel
Length of output: 43
π Script executed:
# Check if there's any initialization or cleanup logic related to STORE_ONLY
rg "processingStrategy" health-monitors/kubernetes-object-monitor/pkg/controller/reconciler.go -B 3 -A 3Repository: NVIDIA/NVSentinel
Length of output: 43
π Script executed:
# Look at the full reconciler to understand the flow better
sed -n '60,150p' health-monitors/kubernetes-object-monitor/pkg/controller/reconciler.goRepository: NVIDIA/NVSentinel
Length of output: 2698
π Script executed:
# Search for any initialization or cleanup that might handle STORE_ONLY annotations
rg "AnnotationKey" health-monitors/kubernetes-object-monitor --context 3Repository: NVIDIA/NVSentinel
Length of output: 3450
π Script executed:
# Check if there's documentation or comments about STORE_ONLY behavior
rg "STORE_ONLY" . -B 2 -A 2 | grep -E "(TODO|FIXME|NOTE|BUG|comment|//)" -A 2 -B 2Repository: NVIDIA/NVSentinel
Length of output: 14129
π Script executed:
# Look at the full annotations manager file to see all methods
wc -l health-monitors/kubernetes-object-monitor/pkg/annotations/manager.goRepository: NVIDIA/NVSentinel
Length of output: 133
π Script executed:
# Check if there's any filtering logic in GetMatches or LoadAllMatches
sed -n '92,140p' health-monitors/kubernetes-object-monitor/pkg/annotations/manager.goRepository: NVIDIA/NVSentinel
Length of output: 1407
Filter stale annotations in STORE_ONLY mode or clear them on startup.
LoadAllMatches reads all node annotations without checking processingStrategy. If a cluster switches from EXECUTE_REMEDIATION to STORE_ONLY, stale annotations from the previous mode will be loaded into memory and used to emit healthy events on restartβviolating STORE_ONLY's observability-only semantics. Either filter out AnnotationKey when loading in STORE_ONLY mode, or clear existing annotations on startup.
π€ Prompt for AI Agents
In `@health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go` around
lines 50 - 55, LoadAllMatches currently loads annotations regardless of
m.processingStrategy, causing stale AnnotationKey entries to be kept when
switching to STORE_ONLY; update LoadAllMatches to ignore/filter out
AnnotationKey annotations when m.processingStrategy ==
pb.ProcessingStrategy_STORE_ONLY (or alternatively add startup logic in the
manager initialization to clear existing AnnotationKey annotations from nodes
when entering STORE_ONLY) so that no previous remediation state is loaded into
memory; reference the LoadAllMatches method, the m.processingStrategy field and
pb.ProcessingStrategy_STORE_ONLY constant and ensure AnnotationKey is excluded
or cleared before any in-memory match state is built.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
π€ Fix all issues with AI agents
In `@tests/kubernetes_object_monitor_test.go`:
- Around line 263-264: The call to RestartDeployment currently ignores its
returned error; update the test to capture the error from
helpers.RestartDeployment(ctx, t, client, helpers.K8S_DEPLOYMENT_NAME,
helpers.NVSentinelNamespace) and handle it immediately (e.g., if err != nil then
t.Fatalf or t.Fatalff with a clear message including the error) so rollout
failures fail the test and the test state isn't left inconsistent; keep the
surrounding context (ctx, t, client, helpers.K8S_DEPLOYMENT_NAME,
helpers.NVSentinelNamespace) unchanged.
π§Ή Nitpick comments (1)
tests/kubernetes_object_monitor_test.go (1)
171-198: Add exported-event validation for STORE_ONLY strategy.These assessments only verify annotation absence. To exercise the new processingStrategy path, also assert the exported event carries STORE_ONLY (and the expected check name) using existing event helper utilities.
Also applies to: 268-295
364996a to
1b13327
Compare
Signed-off-by: Tanisha goyal <[email protected]>
1b13327 to
c031572
Compare
Signed-off-by: Tanisha goyal <[email protected]>
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Testing
Summary by CodeRabbit
New Features
processingStrategyconfiguration option to control health event handling behavior. Two modes available: EXECUTE_REMEDIATION (default) for active remediation, and STORE_ONLY for observability-only event tracking.Tests
βοΈ Tip: You can customize this high-level summary in your review settings.