-
Notifications
You must be signed in to change notification settings - Fork 34
feat: add event handling strategy changes in syslog health monitor #609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add event handling strategy changes in syslog health monitor #609
Conversation
π WalkthroughWalkthroughThis PR adds a configurable processing strategy (EXECUTE_REMEDIATION or STORE_ONLY) via CLI flag and Helm value, threads the chosen strategy through SyslogMonitor and per-check handlers, and includes it in emitted HealthEvent payloads and tests. Changes
Estimated code review effortπ― 3 (Moderate) | β±οΈ ~25 minutes Poem
π₯ Pre-merge checks | β 2 | β 1β Failed checks (1 warning)
β Passed checks (2 passed)
βοΈ Tip: You can configure your own custom pre-merge checks in the settings. π Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro π Files selected for processing (1)
π§° Additional context usedπ§ Learnings (4)π Common learningsπ Learning: 2025-11-24T22:20:48.152ZApplied to files:
π Learning: 2025-11-24T22:20:48.152ZApplied to files:
π Learning: 2025-11-24T22:20:48.152ZApplied to files:
π Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
π§Ή Nitpick comments (10)
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl (1)
103-104: Consider quoting the processing strategy value for consistency.The
processingStrategyvalue is passed unquoted, while similar string arguments like$kataLabelat line 94 use thequotefunction. For consistency and to follow YAML best practices, consider quoting this value.π Proposed fix
- "--processing-strategy" - - {{ $root.Values.processingStrategy }} + - {{ $root.Values.processingStrategy | quote }}tests/helpers/healthevent.go (2)
153-156: Add godoc comment for the exported method.Per coding guidelines, exported Go functions should have function comments. Other
With*methods in this file lack comments too, but new code should set a better example.Suggested documentation
+// WithProcessingStrategy sets the processing strategy for the health event. +// Use protos.ProcessingStrategy_EXECUTE_REMEDIATION (0) or protos.ProcessingStrategy_STORE_ONLY (1). func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate { h.ProcessingStrategy = strategy return h }
48-48: Consider using the protobuf type for consistency.The protobuf-generated Go code uses
int32forProcessingStrategy. Whileintworks for JSON serialization in tests, usingint32would be more consistent with the generated types.Suggested change
- ProcessingStrategy int `json:"processingStrategy,omitempty"` + ProcessingStrategy int32 `json:"processingStrategy,omitempty"`This would require updating the setter signature as well:
-func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate { +func (h *HealthEventTemplate) WithProcessingStrategy(strategy int32) *HealthEventTemplate {data-models/protobufs/health_event.proto (1)
77-77: Add field-level comment forprocessingStrategy.Per coding guidelines for protobuf files: "Include comprehensive comments for all fields in Protocol Buffer messages." The enum declaration has good comments, but the field itself should also have a brief comment explaining its purpose in the context of the
HealthEventmessage.Suggested addition
BehaviourOverrides drainOverrides = 15; + // Determines how downstream modules should handle this event (remediation vs observability-only). ProcessingStrategy processingStrategy = 16; }health-monitors/syslog-health-monitor/main.go (1)
164-171: Consider enhancing the error message with valid options.The validation correctly uses the generated
pb.ProcessingStrategy_valuemap. Consider listing valid values in the error message to improve user experience.Suggested improvement
value, ok := pb.ProcessingStrategy_value[*processingStrategyFlag] if !ok { - return fmt.Errorf("unexpected processingStrategy value: %q", *processingStrategyFlag) + return fmt.Errorf("unexpected processingStrategy value: %q (valid values: EXECUTE_REMEDIATION, STORE_ONLY)", *processingStrategyFlag) }health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (1)
239-242: Consider using the enum constant directly instead of accessing the handler field.Line 241 accesses
handler2.processingStrategydirectly for the assertion. While this works, using the expected enum constant (pb.ProcessingStrategy_EXECUTE_REMEDIATION) would be more explicit and resilient to future encapsulation changes.π Suggested change
- assert.Equal(t, handler2.processingStrategy, events.Events[0].ProcessingStrategy) + assert.Equal(t, pb.ProcessingStrategy_EXECUTE_REMEDIATION, events.Events[0].ProcessingStrategy)health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (1)
191-195: Brittle entity index assertions may break if entity order changes.The assertions use hardcoded indices (
EntitiesImpacted[3],EntitiesImpacted[4]) to access GPU entities. If the entity ordering inProcessLinechanges, these tests will fail with confusing errors.Consider iterating over entities by type, similar to the pattern used in
gpufallen_handler_test.go(lines 124-136).π Suggested approach
- // Verify GPU entity - assert.Equal(t, "GPU", event.EntitiesImpacted[3].EntityType) - assert.Equal(t, "1", event.EntitiesImpacted[3].EntityValue) - assert.Equal(t, "GPU_UUID", event.EntitiesImpacted[4].EntityType) - assert.Equal(t, "GPU-aaaabbbb-cccc-dddd-eeee-ffffffffffff", event.EntitiesImpacted[4].EntityValue) + // Verify GPU entities by type rather than assuming order + var hasGPU, hasGPUUUID bool + for _, entity := range event.EntitiesImpacted { + switch entity.EntityType { + case "GPU": + hasGPU = true + assert.Equal(t, "1", entity.EntityValue) + case "GPU_UUID": + hasGPUUUID = true + assert.Equal(t, "GPU-aaaabbbb-cccc-dddd-eeee-ffffffffffff", entity.EntityValue) + } + } + assert.True(t, hasGPU, "Should have GPU entity") + assert.True(t, hasGPUUUID, "Should have GPU_UUID entity")tests/helpers/syslog-health-monitor.go (1)
15-15: Missing package-level godoc comment.As per coding guidelines, package-level godoc is required for all Go packages.
Proposed fix
+// Package helpers provides test utility functions for NVSentinel integration tests. package helpershealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (1)
427-446: Consider adding a test case to validate ProcessingStrategy is stored in handler.The existing test cases verify handler construction but don't assert that
processingStrategyis correctly stored. Consider adding an assertion.Proposed enhancement
} else { assert.NoError(t, err) require.NotNil(t, handler) assert.Equal(t, tc.nodeName, handler.nodeName) assert.Equal(t, tc.agentName, handler.defaultAgentName) assert.Equal(t, tc.componentClass, handler.defaultComponentClass) assert.Equal(t, tc.checkName, handler.checkName) + assert.Equal(t, pb.ProcessingStrategy_EXECUTE_REMEDIATION, handler.processingStrategy) assert.NotNil(t, handler.pciToGPUUUID) assert.NotNil(t, handler.parser) assert.NotNil(t, handler.metadataReader) }tests/helpers/kube.go (1)
2346-2390: Consider clarifying the daemonsetName parameter usage.The
daemonsetNameparameter is only used in the error message but not for any actual filtering logic (which relies onpodNamePattern). This could be confusing to callers. Consider either using it for filtering or documenting this clearly.π Suggested documentation improvement
+// GetDaemonSetPodOnWorkerNode returns a running, ready pod matching the given name pattern on a real worker node. +// Note: daemonsetName is used for error messages; filtering is done via podNamePattern. func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, podNamePattern string) (*v1.Pod, error) {
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (34)
data-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpldistros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yamlevent-exporter/pkg/transformer/cloudevents.goevent-exporter/pkg/transformer/cloudevents_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-quarantine/pkg/initializer/init.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyihealth-monitors/syslog-health-monitor/main.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/types.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/types.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/types.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gostore-client/pkg/client/mongodb_pipeline_builder.gostore-client/pkg/client/pipeline_builder.gostore-client/pkg/client/pipeline_builder_test.gostore-client/pkg/client/postgresql_pipeline_builder.gotests/event_exporter_test.gotests/helpers/event_exporter.gotests/helpers/healthevent.gotests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.go
π§° Additional context used
π Path-based instructions (5)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/helpers/event_exporter.gostore-client/pkg/client/pipeline_builder_test.gotests/helpers/healthevent.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/types.goevent-exporter/pkg/transformer/cloudevents.gostore-client/pkg/client/mongodb_pipeline_builder.gohealth-monitors/syslog-health-monitor/pkg/sxid/types.gofault-quarantine/pkg/initializer/init.gofault-quarantine/pkg/evaluator/rule_evaluator_test.goevent-exporter/pkg/transformer/cloudevents_test.gostore-client/pkg/client/pipeline_builder.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gostore-client/pkg/client/postgresql_pipeline_builder.gohealth-monitors/syslog-health-monitor/main.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gotests/helpers/syslog-health-monitor.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/types.gotests/event_exporter_test.gotests/helpers/kube.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gotests/syslog_health_monitor_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
store-client/pkg/client/pipeline_builder_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.goevent-exporter/pkg/transformer/cloudevents_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/event_exporter_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gotests/syslog_health_monitor_test.go
data-models/protobufs/**/*.proto
π CodeRabbit inference engine (.github/copilot-instructions.md)
data-models/protobufs/**/*.proto: Define Protocol Buffer messages indata-models/protobufs/directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages
Files:
data-models/protobufs/health_event.proto
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
π§ Learnings (8)
π Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
health-monitors/syslog-health-monitor/pkg/sxid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/main.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gotests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`
Applied to files:
health-monitors/syslog-health-monitor/pkg/sxid/types.go
π Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.
Applied to files:
fault-quarantine/pkg/initializer/init.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/event_exporter_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gotests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/event_exporter_test.gotests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `testify/assert` and `testify/require` for assertions in Go tests
Applied to files:
tests/event_exporter_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/syslog_health_monitor_test.go
𧬠Code graph analysis (17)
store-client/pkg/client/pipeline_builder_test.go (3)
store-client/pkg/client/pipeline_builder.go (1)
PipelineBuilder(26-47)store-client/pkg/client/mongodb_pipeline_builder.go (1)
NewMongoDBPipelineBuilder(29-31)store-client/pkg/client/postgresql_pipeline_builder.go (1)
NewPostgreSQLPipelineBuilder(29-31)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
event-exporter/pkg/transformer/cloudevents.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (3)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
XIDHandler(29-39)
store-client/pkg/client/postgresql_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
health-monitors/syslog-health-monitor/main.go (1)
data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_value(56-59)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)
data-models/pkg/protos/health_event.pb.go (17)
HealthEvent(260-280)HealthEvent(293-293)HealthEvent(308-310)Entity(208-214)Entity(227-227)Entity(242-244)RecommendedAction(89-89)RecommendedAction(139-141)RecommendedAction(143-145)RecommendedAction(152-154)RecommendedAction_CONTACT_SUPPORT(94-94)ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)
NewK8sConnector(47-58)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (3)
data-models/pkg/protos/health_event.pb.go (6)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
NewXIDHandler(33-58)health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
XIDHandler(29-39)
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go (1)
health-monitors/syslog-health-monitor/pkg/gpufallen/types.go (1)
GPUFallenHandler(47-57)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (2)
data-models/pkg/protos/health_event.pb.go (11)
HealthEvents(156-162)HealthEvents(175-175)HealthEvents(190-192)HealthEvent(260-280)HealthEvent(293-293)HealthEvent(308-310)ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)
K8sConnector(37-45)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
ValidateCloudEvent(221-247)
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)
data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (1)
data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (2)
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (1)
NewSXIDHandler(30-42)data-models/pkg/protos/health_event.pb.go (3)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)ProcessingStrategy_STORE_ONLY(47-47)RecommendedAction_CONTACT_SUPPORT(94-94)
tests/syslog_health_monitor_test.go (3)
tests/helpers/syslog-health-monitor.go (3)
SetUpSyslogHealthMonitor(34-74)StubJournalHTTPPort(28-28)TearDownSyslogHealthMonitor(77-116)tests/helpers/kube.go (2)
InjectSyslogMessages(1796-1819)PortForwardPod(1590-1634)tests/helpers/fault_quarantine.go (2)
AssertQuarantineState(315-382)QuarantineAssertion(56-60)
π Additional comments (65)
fault-quarantine/pkg/initializer/init.go (1)
66-66: Verify the pipeline filtering aligns with fault-quarantine requirements.The switch from
BuildAllHealthEventInsertsPipeline()toBuildProcessableHealthEventInsertsPipeline()means the fault-quarantine module will now only observe health events marked with theEXECUTE_REMEDIATIONprocessing strategy. Events markedSTORE_ONLYwill be excluded from triggering quarantine actions.Run the following script to confirm the new pipeline builder exists and verify its filtering logic:
#!/bin/bash # Verify BuildProcessableHealthEventInsertsPipeline implementation and filtering behavior # Find the pipeline builder implementation echo "=== Locating BuildProcessableHealthEventInsertsPipeline implementation ===" ast-grep --pattern $'func ($$$) BuildProcessableHealthEventInsertsPipeline() $$$' # Find usages to understand the filtering behavior echo -e "\n=== Finding references to ProcessingStrategy enum ===" rg -n -C3 --type=go 'ProcessingStrategy|EXECUTE_REMEDIATION|STORE_ONLY' store-client/ # Find the builder interface/struct definition echo -e "\n=== Locating PipelineBuilder interface ===" rg -n -C3 --type=go 'type.*PipelineBuilder|BuildProcessableHealthEventInsertsPipeline'fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)
263-263: LGTM! Default enum value correctly validated.The test correctly expects
processingStrategyto be0(EXECUTE_REMEDIATION), which is the protobuf default for the new enum field.event-exporter/pkg/transformer/cloudevents.go (1)
66-66: LGTM! Consistent enum-to-string conversion.The implementation correctly uses
ProcessingStrategy.String()to include the strategy in CloudEvent data, matching the pattern used forrecommendedActionat line 61.store-client/pkg/client/pipeline_builder_test.go (1)
69-86: LGTM! Well-structured table-driven test.The test correctly validates the new
BuildProcessableHealthEventInsertsPipeline()method for both MongoDB and PostgreSQL builders, following the established patterns in this test file.health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go (1)
53-53: LGTM! Safe addition of unexported field.The
processingStrategyfield is correctly added as an unexported field to store the health event processing strategy for this monitor instance.health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
20-20: LGTM! Consistent import and field addition.The protobuf import and
processingStrategyfield are correctly added to enable strategy propagation through the XID handler to health events.Also applies to: 34-34
health-monitors/syslog-health-monitor/pkg/gpufallen/types.go (1)
23-23: LGTM! Consistent with other handler implementations.The protobuf import and
processingStrategyfield addition follow the same pattern used in the XID and SXID handlers.Also applies to: 52-52
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml (1)
51-56: LGTM! Well-documented configuration option.The
processingStrategyconfiguration is clearly documented with valid values, default behavior, and behavioral differences. The default valueEXECUTE_REMEDIATIONmaintains backward compatibility with existing deployments.As per coding guidelines, the inline documentation follows Helm best practices for values.yaml files.
data-models/protobufs/health_event.proto (1)
32-38: Good enum design with clear documentation.The enum is well-designed with
EXECUTE_REMEDIATION=0as the default (zero value), ensuring backward compatibility for existing events that don't set this field. The comments clearly explain the behavioral difference between the two strategies.store-client/pkg/client/mongodb_pipeline_builder.go (1)
87-100: Verify behavior whenprocessingStrategyfield is missing or null.The filter matches documents where
fullDocument.healthevent.processingstrategyequals0(EXECUTE_REMEDIATION). In MongoDB, documents missing this field will havenullfor that path, which does not equal0. This could cause backward compatibility issues with existing health events that were created before this field was added.If existing events should be treated as
EXECUTE_REMEDIATION(the intended default), consider updating the filter to handle missing fields:Potential fix if missing fields need to match
func (b *MongoDBPipelineBuilder) BuildProcessableHealthEventInsertsPipeline() datastore.Pipeline { return datastore.ToPipeline( datastore.D( datastore.E("$match", datastore.D( datastore.E("operationType", datastore.D( datastore.E("$in", datastore.A("insert")), )), - datastore.E("fullDocument.healthevent.processingstrategy", int32(protos.ProcessingStrategy_EXECUTE_REMEDIATION)), + datastore.E("$or", datastore.A( + datastore.D(datastore.E("fullDocument.healthevent.processingstrategy", int32(protos.ProcessingStrategy_EXECUTE_REMEDIATION))), + datastore.D(datastore.E("fullDocument.healthevent.processingstrategy", datastore.D(datastore.E("$exists", false)))), + )), )), ), ) }Please verify whether existing health events in the database have this field set, or if all events going forward will include it.
store-client/pkg/client/pipeline_builder.go (1)
35-38: LGTM!The new interface method is well-documented with clear explanation of:
- What it does: watches for "processable" health event inserts with
processingStrategy=EXECUTE_REMEDIATION- Who uses it: fault-quarantine
- Why: to ignore observability-only events (
STORE_ONLY)This follows the established pattern of the other pipeline builder methods.
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)
1-6: This file is auto-generated by the protobuf compiler and is correctly checked into the repository. The "NO CHECKED-IN PROTOBUF GENCODE" comment is a standard protobuf compiler-generated header, not a project policy directive. For Python (non-C++), checking in generated protobuf files is a standard practice supported by the protobuf project. The repository properly manages this via theprotos-generateMakefile target for regeneration and theprotos-linttarget for CI validation to ensure generated files stay up to date with their source.protofiles.Likely an incorrect or invalid review comment.
tests/helpers/event_exporter.go (1)
221-246: LGTM!The new
expectedProcessingStrategyparameter integrates cleanly with the existing validation pattern. The assertion at line 245 correctly validates the new field in the CloudEvent payload.health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)
20-36: LGTM!The new
processingStrategyfield is correctly typed and logically positioned among the configuration fields inSXIDHandler.event-exporter/pkg/transformer/cloudevents_test.go (1)
69-108: LGTM!The test correctly validates the new
processingStrategyfield in CloudEvent output. The explicitSTORE_ONLYvalue in the first comprehensive test case provides good coverage, while other focused test cases appropriately default to the zero value (EXECUTE_REMEDIATION).tests/event_exporter_test.go (1)
85-85: LGTM!The test correctly expects
"EXECUTE_REMEDIATION"as the processing strategy, which aligns with the default behavior for health events.health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go (2)
29-51: LGTM!The
processingStrategyparameter is correctly threaded through the constructor to the handler struct, following the established pattern for other configuration fields.
225-239: LGTM!The
ProcessingStrategyfield is correctly populated from the handler's configuration when creating health events.health-monitors/syslog-health-monitor/main.go (2)
70-71: LGTM!The new
--processing-strategyflag with sensible default and clear help text follows the existing flag pattern in this file.
173-184: LGTM!The
processingStrategyis correctly converted and passed toNewSyslogMonitor. The overall wiring integrates cleanly with the existing initialization flow.health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (2)
30-42: LGTM! Constructor correctly extended with ProcessingStrategy parameter.The constructor signature update properly accepts the new
processingStrategyparameter and assigns it to the handler struct. This is consistent with the pattern used across other handlers (XID, GPUFallen) in this PR.
109-109: ProcessingStrategy correctly propagated to emitted HealthEvent.The handler's
processingStrategyis properly assigned to theHealthEvent.ProcessingStrategyfield, ensuring events carry the configured strategy for downstream processing.health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (3)
167-175: LGTM! Test updated to validate ProcessingStrategy propagation.The test correctly passes
pb.ProcessingStrategy_STORE_ONLYto the handler constructor, following the table-driven test pattern and usingtestify/requirefor assertions as per coding guidelines.
185-185: Good assertion coverage for ProcessingStrategy field.This assertion ensures the emitted event carries the correct processing strategy, validating the end-to-end propagation from handler construction to event emission.
193-319: Comprehensive XID tracking tests properly updated with ProcessingStrategy.All test cases in
TestXIDTrackingconsistently usepb.ProcessingStrategy_EXECUTE_REMEDIATION, maintaining test consistency while validating the new parameter integration.platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (3)
1391-1506: Well-structured table-driven test for ProcessingStrategy behavior.The test covers key scenarios:
STORE_ONLYevents should not create node conditions or K8s eventsEXECUTE_REMEDIATIONevents should create node conditions- Mixed strategies properly filter only
EXECUTE_REMEDIATIONeventsThis follows the coding guidelines for table-driven tests with descriptive names.
1550-1562: Reasonable approach to count NVSentinel-specific conditions.The logic excludes standard Kubernetes node conditions (NodeReady, NodeMemoryPressure, etc.) to isolate NVSentinel-added conditions. The logging at line 1560 aids debugging.
1508-1589: Good test isolation with per-case setup.Each test case creates its own
localClientSet,ringBuffer, andconnector, preventing state leakage between test runs. Thedefer close(stopCh)ensures proper cleanup.health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (3)
18-25: Imports correctly updated for ProcessingStrategy testing.New imports for
os,filepath, and thepbalias are properly added to support the temp file creation and protobuf type usage in tests.
143-196: Good integration test with realistic metadata setup.This test creates a temporary
gpu_metadata.json, exercises the fullProcessLineflow with valid NVSwitch topology, and validates the emitted event including the newProcessingStrategyfield.
225-249: Test cases properly updated with ProcessingStrategy parameter.The
TestProcessLinetable-driven tests correctly passpb.ProcessingStrategy_EXECUTE_REMEDIATIONand validate it in assertions, following the established pattern. Usestestify/assertandtestify/requireper coding guidelines.health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (7)
309-317: Test correctly validates ProcessingStrategy initialization.The test verifies that
NewSyslogMonitorproperly stores theprocessingStrategyparameter. The assertion at line 317 confirms the monitor's field is set to the expected value.
329-334: Good coverage testing both constructor variants.Testing
NewSyslogMonitorWithFactorywithEXECUTE_REMEDIATIONensures both constructor paths properly handle the new parameter.
391-403: TestJournalProcessingLogic updated with ProcessingStrategy.The test correctly passes
pb.ProcessingStrategy_EXECUTE_REMEDIATIONto maintain consistency with the expected handler behavior during journal processing.
494-544: TestJournalStateManagement properly tests state persistence with ProcessingStrategy.Both monitor instances (sm and sm2) are created with consistent
EXECUTE_REMEDIATIONstrategy, ensuring the state persistence test accurately reflects real-world usage.
574-593: TestBootIDChangeHandling correctly includes ProcessingStrategy.The boot ID detection test is updated to include the new parameter, maintaining test consistency across the test suite.
624-644: TestRunMultipleChecks validates handler initialization with ProcessingStrategy.The test confirms that both XID and SXID handlers are properly initialized when multiple checks are configured, with the processing strategy parameter correctly passed through.
665-679: TestGPUFallenOffHandlerInitialization updated for completeness.The GPU Fallen Off handler initialization test includes the
ProcessingStrategyparameter, ensuring all handler types are tested with the new field.platform-connectors/pkg/connectors/kubernetes/process_node_events.go (4)
325-343: LGTM: Clean filtering implementation for STORE_ONLY events.The function correctly filters out
STORE_ONLYhealth events with appropriate info-level logging for observability. The implementation aligns with the broader PR changes introducingProcessingStrategy.
345-370: LGTM: Good extraction of K8s event creation logic.The function consolidates event creation into a reusable helper, improving code organization. All required fields are properly populated from the health event.
388-400: LGTM: Correct integration of filtered events with node condition updates.The timing instrumentation and error handling are properly applied to the filtered
processableEventsset.
402-413: LGTM: Node event creation correctly uses filtered events.The loop properly iterates over
processableEventsand uses the newcreateK8sEventhelper for event creation.tests/syslog_health_monitor_test.go (6)
25-31: LGTM: Import organization for centralized helpers.The imports correctly bring in the test helpers package and required Kubernetes types for DaemonSet management.
36-41: LGTM: Context keys for test state management.The context keys are well-named and follow a consistent pattern for storing test state across setup, assess, and teardown phases.
163-175: LGTM: Teardown correctly uses centralized helper.The teardown properly retrieves context values and delegates to
TearDownSyslogHealthMonitorfor cleanup.
200-208: LGTM: Consistent use of helper constant for port.Using
helpers.StubJournalHTTPPortinstead of a local constant improves maintainability.
368-385: LGTM: New test for STORE_ONLY strategy.The test setup correctly passes
trueforupdatedDaemonSetto configure the STORE_ONLY processing strategy.
411-422: LGTM: Teardown correctly restores DaemonSet.The teardown passes
trueforupdatedDaemonSetto restore the original DaemonSet configuration after the test.health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (2)
33-58: LGTM: Clean addition ofprocessingStrategyparameter.The new parameter is properly integrated into the constructor and stored in the handler struct. This follows the same pattern used in other handlers (SXID, GPUFallen) as per the PR objectives.
189-189: LGTM: ProcessingStrategy correctly propagated to HealthEvent.The strategy is properly assigned to the generated health event, enabling downstream filtering in the platform connector.
tests/helpers/syslog-health-monitor.go (3)
27-31: LGTM: Well-defined constants for test configuration.The constants provide clear, maintainable values for port and resource names used across syslog health monitor tests.
33-74: LGTM: Comprehensive test setup helper.The function properly:
- Conditionally updates DaemonSet processing strategy
- Locates the syslog pod on a worker node
- Injects test metadata
- Establishes port-forward
- Sets node management label
Good use of
requirefor critical setup steps that should fail the test if unsuccessful.
76-116: Consider handlingoriginalDaemonSetnil case whenupdatedDaemonSetis true.If
SetUpSyslogHealthMonitorreturns a niloriginalDaemonSetwhenupdatedDaemonSetis true (due to an error during setup that wasn't caught),RestoreDaemonSetcould receive an unexpected nil. However, sinceSetUpSyslogHealthMonitorusesrequire.NoError, this scenario is unlikely in practice.The teardown logic is otherwise well-structured with proper error handling and cleanup sequencing.
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (5)
88-91: LGTM: Updated constructor call with processing strategy.The test correctly passes
pb.ProcessingStrategy_EXECUTE_REMEDIATIONto the updatedNewXIDHandlersignature.
142-142: Good coverage of both processing strategies in test cases.The test cases appropriately use different strategies (
EXECUTE_REMEDIATIONon line 142,STORE_ONLYon line 157) to ensure both paths are validated.Also applies to: 157-157
194-194: LGTM: ProcessingStrategy assertions added to validation.The new assertions correctly verify that the
ProcessingStrategyfield on generatedHealthEventobjects matches the strategy configured on the handler.Also applies to: 231-231
306-308: LGTM: Test handler initialized with processing strategy.The test correctly initializes the handler with
EXECUTE_REMEDIATIONstrategy for theTestCreateHealthEventFromResponsetest cases.
348-348: LGTM: Consistent ProcessingStrategy validation in event creation tests.Both test cases properly validate that the handler's processing strategy is propagated to the created health events.
Also applies to: 378-378
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
14-17: LGTM - Protobuf stubs correctly reflect ProcessingStrategy additions.The generated
.pyistub file properly exposes the newProcessingStrategyenum withEXECUTE_REMEDIATIONandSTORE_ONLYvalues, and extendsHealthEventwith theprocessingStrategyfield. The type hints follow protobuf Python conventions.Also applies to: 31-32, 78-78, 104-104, 120-120, 138-138
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (3)
41-57: LGTM - Constructor signatures correctly extended with processingStrategy.The
processingStrategyparameter is properly threaded through bothNewSyslogMonitorandNewSyslogMonitorWithFactory, maintaining the constructor chain. Adding as the last parameter is good practice for API evolution.
90-136: LGTM - ProcessingStrategy correctly propagated to all handlers.The
processingStrategyis properly stored in theSyslogMonitorstruct and consistently passed to all three handler types (XID, SXID, GPUFallen). This ensures uniform processing behavior across different error check types.
802-814: LGTM - HealthEvent correctly includes ProcessingStrategy field.The
prepareHealthEventWithActionmethod properly setsProcessingStrategy: sm.processingStrategywhen constructing health events, ensuring the configured strategy is propagated through emitted events.tests/helpers/kube.go (3)
2210-2249: LGTM - DaemonSet rollout wait logic is correct.The function properly checks
DesiredNumberScheduled,UpdatedNumberScheduled, andNumberReadystatus fields to determine rollout completion. The logging provides good visibility into rollout progress.
2252-2268: LGTM - Container argument update logic handles both flag formats.The function correctly handles both
--processing-strategy=VALUEand--processing-strategy VALUEargument formats, and properly appends the flag if it doesn't exist. The early return after modification prevents duplicate additions.
2312-2344: LGTM - RestoreDaemonSet properly restores original container configuration.The function correctly handles nil input, uses
retry.RetryOnConflictfor safe updates, and waits for rollout completion. Replacing the entire containers slice ensures complete restoration of the original state.
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
β»οΈ Duplicate comments (2)
tests/syslog_health_monitor_test.go (1)
51-52: Remove redundant error check.Line 52 checks
require.NoError(t, err, ...)butSetUpSyslogHealthMonitorreturns(string, *v1.Pod, chan struct{})with no error. Theerrvariable is undefined.π Proposed fix
testNodeName, syslogPod, stopChan := helpers.SetUpSyslogHealthMonitor(ctx, t, client, nil) - require.NoError(t, err, "failed to set up syslog health monitor")tests/helpers/kube.go (1)
2272-2277: Missing error when container is not found.The loop that searches for the container by name silently does nothing if no match is found. This could mask test configuration errors where the wrong container name is provided.
π Proposed fix to validate container exists
+ found := false containers := daemonSet.Spec.Template.Spec.Containers for i := range containers { if containers[i].Name == containerName { setArgsOnContainer(t, &containers[i], args) + found = true break } } + + if !found { + return fmt.Errorf("container %q not found in daemonset %s/%s", containerName, NVSentinelNamespace, daemonsetName) + } return client.Resources().Update(ctx, daemonSet)Based on learnings, this issue was previously flagged but remains unresolved.
π§Ή Nitpick comments (4)
tests/helpers/kube.go (4)
2251-2255: Add godoc comment for exported function.As per coding guidelines, exported functions require documentation comments. Add a godoc comment explaining the purpose, parameters, and behavior of this function.
π Proposed godoc comment
+// UpdateDaemonSetArgs updates the specified container arguments in a DaemonSet and waits for rollout completion. +// If containerName is empty, updates all containers. Otherwise, updates only the named container. +// Uses retry.RetryOnConflict for automatic retry handling. func UpdateDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, containerName string, args map[string]string) error {As per coding guidelines, function comments are required for all exported Go functions.
2288-2289: Consider removing hard-coded sleep.The 10-second sleep after rollout completion may not be necessary since
waitForDaemonSetRolloutalready ensures pods are ready. If pods need additional time to stabilize, consider usingrequire.Eventuallyto poll for the specific condition instead of a fixed delay.
2294-2297: Add godoc comment for exported function.Add documentation for this exported function explaining its purpose, parameters, and behavior.
π Proposed godoc comment
+// RemoveDaemonSetArgs removes the specified arguments from a DaemonSet container and waits for rollout completion. +// If containerName is empty, removes from all containers. Otherwise, removes only from the named container. +// Uses retry.RetryOnConflict for automatic retry handling. func RemoveDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, containerName string, args map[string]string,As per coding guidelines, function comments are required for all exported Go functions.
2411-2413: Add godoc comment for exported function.Add documentation for this exported function explaining its purpose, parameters, return values, and the validation it performs.
π Proposed godoc comment
+// GetDaemonSetPodOnWorkerNode returns a ready, running pod from the specified DaemonSet on a real worker node. +// It waits for a pod matching podNamePattern that is not being deleted and has all containers ready. +// Returns an error if no suitable pod is found within the timeout period. func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, podNamePattern string) (*v1.Pod, error) {As per coding guidelines, function comments are required for all exported Go functions.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (3)
tests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.go
π§ Files skipped from review as they are similar to previous changes (1)
- tests/helpers/syslog-health-monitor.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/helpers/kube.gotests/syslog_health_monitor_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/syslog_health_monitor_test.go
π§ Learnings (7)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation
Applied to files:
tests/helpers/kube.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label
Applied to files:
tests/helpers/kube.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
tests/syslog_health_monitor_test.go
𧬠Code graph analysis (1)
tests/syslog_health_monitor_test.go (2)
tests/helpers/syslog-health-monitor.go (3)
SetUpSyslogHealthMonitor(33-71)StubJournalHTTPPort(27-27)TearDownSyslogHealthMonitor(74-113)tests/helpers/kube.go (1)
PortForwardPod(1590-1634)
π Additional comments (8)
tests/syslog_health_monitor_test.go (8)
89-89: LGTM: Consistent use of StubJournalHTTPPort.The test correctly uses the centralized
helpers.StubJournalHTTPPortconstant for syslog message injection.Also applies to: 123-123
160-171: LGTM: Teardown properly centralized.The teardown correctly extracts context values and delegates to
helpers.TearDownSyslogHealthMonitor.
185-216: LGTM: Manual setup appropriate for metadata-less test.This test intentionally skips metadata injection to verify XID detection without metadata, so the manual setup (without using
SetUpSyslogHealthMonitor) is appropriate.
296-306: LGTM: Proper use of setup helper.The setup correctly delegates to
helpers.SetUpSyslogHealthMonitorand stores context values.
346-360: LGTM: Proper use of teardown helper.The teardown correctly extracts context values and delegates to
helpers.TearDownSyslogHealthMonitor.
362-380: LGTM: Well-structured STORE_ONLY strategy test.The new test properly configures the syslog health monitor with
STORE_ONLYstrategy via setup helper arguments and stores necessary context values for teardown.
382-404: LGTM: Accurate test description and assertions.The assess description correctly describes the test intent: verifying that XID errors in STORE_ONLY mode do not create node conditions or cordon nodes. The assertions properly validate this behavior.
406-419: LGTM: Proper strategy restoration in teardown.The teardown correctly restores
EXECUTE_REMEDIATIONstrategy to avoid impacting subsequent tests.
776a34b to
0a1dc9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
π€ Fix all issues with AI agents
In
@platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go:
- Around line 1606-1612: The test currently indexes nvsentinelConditions[0] when
tc.expectNodeConditions is true which can panic if the slice is empty; update
the assertion to first assert the slice has at least one element (e.g.,
assert.Greater(t, len(nvsentinelConditions), 0) or assert.Len/NotEmpty) and only
then assert the Type value against tc.expectedConditionType, referring to the
existing test variables tc.expectNodeConditions and nvsentinelConditions to
locate where to add the length check before accessing index 0.
- Around line 1620-1626: The test accesses events.Items[0] directly which can
panic if the slice is empty; when tc.expectKubernetesEvents is true first assert
that events.Items is non-empty (e.g. use assert.Greater(t, len(events.Items), 0)
or assert.NotEmpty) and only then compare events.Items[0].Type to
tc.expectedEventType, ensuring the check references tc.expectKubernetesEvents,
events.Items and tc.expectedEventType so the test fails gracefully instead of
panicking.
In @tests/data/healthy-event.json:
- Around line 19-20: The processingStrategy for the healthy event is wrong:
update the JSON key "processingStrategy" in the healthy event object (the one
containing "nodeName": "NODE_NAME" and "isHealthy": true / "message": "No health
failures") from 1 to 2 so it uses STORE_ONLY rather than EXECUTE_REMEDIATION;
ensure the numeric value is changed to 2 to reflect observability-only behavior.
In @tests/helpers/kube.go:
- Around line 2334-2338: The loop over containers that calls
removeArgsFromContainer(&containers[i], args) must validate the target container
was found; add a boolean found flag (e.g., found := false) set to true when
containers[i].Name == containerName, break, and after the loop check found and
return an error (or propagate an appropriate error) if false so the caller knows
the container wasn't presentβmirror the behavior in UpdateDaemonSetArgs and
reference the containerName, removeArgsFromContainer, and UpdateDaemonSetArgs
symbols when making the change.
- Around line 2296-2301: The loop over containers that compares
containers[i].Name to containerName and calls setArgsOnContainer(t,
&containers[i], args) misses handling when no match is found; update the code
that iterates over containers (the block referencing containers, containerName,
and setArgsOnContainer) to track whether a container was found (e.g., found :=
false), set found = true when you call setArgsOnContainer, and after the loop
either call t.Fatalf (or return an error from the helper) with a clear message
including containerName so the test fails loudly if no container matched.
In @tests/syslog_health_monitor_test.go:
- Around line 51-52: Remove the redundant error assertion: the call to
helpers.SetUpSyslogHealthMonitor(ctx, t, client, nil) returns (testNodeName,
syslogPod, stopChan) and not an error, so delete the require.NoError(t, err,
"failed to set up syslog health monitor") line (which references an
uninitialized err); ensure any later references to err in this test are handled
or removed as well.
π§Ή Nitpick comments (5)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
1502-1524: Consider consolidating or differentiating duplicate test scenarios.This test case ("STORE_ONLY non fatal event should not create Kubernetes event") appears to duplicate the scenario at lines 1423-1442. Both test
STORE_ONLYwithIsFatal: falseexpecting no node conditions and no Kubernetes events. If you intend to test differentCheckNamevalues, consider clarifying this in the test name or description; otherwise, consolidate to reduce redundancy.data-models/protobufs/health_event.proto (1)
78-78: Add field-level documentation for processingStrategy.While the enum itself is well-documented, the field lacks a comment explaining its purpose and usage context within the HealthEvent message.
π Suggested documentation
BehaviourOverrides quarantineOverrides = 14; BehaviourOverrides drainOverrides = 15; + // Directs how downstream modules (platform connectors, event stores) should handle this event. + // Defaults to UNSPECIFIED if not explicitly set. ProcessingStrategy processingStrategy = 16;Based on coding guidelines: "Include comprehensive comments for all fields in Protocol Buffer messages."
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
33-58: ProcessingStrategy correctly threaded through constructor.The
processingStrategyparameter is properly captured and stored in theXIDHandlerstruct. The initialization follows the same pattern as other configuration fields.Consider whether
processingStrategyshould be validated in the constructor to rejectProcessingStrategy_UNSPECIFIED(the proto default value of 0). While validation currently occurs inmain.gobefore constructors are called, adding validation here would provide defensive programming and catch configuration errors at the point of construction, consistent with best practices for enum parameters that must have meaningful values.tests/helpers/healthevent.go (1)
31-31: Consider using the enum type for type safety.The
ProcessingStrategyfield is declared asintrather thanprotos.ProcessingStrategy. While this may be intentional for JSON marshaling flexibility, using the enum type would provide stronger type safety and prevent invalid values at compile time.β»οΈ Suggested refactor for type safety
If JSON marshaling works correctly with the enum type, consider:
type HealthEventTemplate struct { Version int `json:"version"` Agent string `json:"agent"` ComponentClass string `json:"componentClass,omitempty"` CheckName string `json:"checkName"` IsFatal bool `json:"isFatal"` IsHealthy bool `json:"isHealthy"` Message string `json:"message"` RecommendedAction int `json:"recommendedAction,omitempty"` ErrorCode []string `json:"errorCode,omitempty"` EntitiesImpacted []EntityImpacted `json:"entitiesImpacted,omitempty"` Metadata map[string]string `json:"metadata,omitempty"` QuarantineOverrides *QuarantineOverrides `json:"quarantineOverrides,omitempty"` NodeName string `json:"nodeName"` - ProcessingStrategy int `json:"processingStrategy,omitempty"` + ProcessingStrategy protos.ProcessingStrategy `json:"processingStrategy,omitempty"` }And update the constructor:
func NewHealthEvent(nodeName string) *HealthEventTemplate { return &HealthEventTemplate{ Version: 1, Agent: "gpu-health-monitor", ComponentClass: "GPU", CheckName: "GpuXidError", IsFatal: true, IsHealthy: false, NodeName: nodeName, EntitiesImpacted: []EntityImpacted{ { EntityType: "GPU", EntityValue: "0", }, }, - ProcessingStrategy: int(protos.ProcessingStrategy_EXECUTE_REMEDIATION), + ProcessingStrategy: protos.ProcessingStrategy_EXECUTE_REMEDIATION, } }And update the fluent setter signature:
-func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate { +func (h *HealthEventTemplate) WithProcessingStrategy(strategy protos.ProcessingStrategy) *HealthEventTemplate { h.ProcessingStrategy = strategy return h }Also applies to: 49-49, 76-76
tests/fault_quarantine_test.go (1)
334-346: Teardown only resetsSysLogsXIDError- consider resettingGpuPowerWatchas well.The teardown sends a healthy event for
SysLogsXIDErrorbut doesn't explicitly clear theGpuPowerWatchstate. WhileTeardownQuarantineTestmay handle general cleanup, explicitly resetting both check types would make the test more self-contained and prevent potential state leakage to subsequent tests.β»οΈ Suggested improvement
feature.Teardown(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { event := helpers.NewHealthEvent(testCtx.NodeName). WithErrorCode("79"). WithHealthy(true). WithAgent(helpers.SYSLOG_HEALTH_MONITOR_AGENT). WithCheckName("SysLogsXIDError") helpers.SendHealthEvent(ctx, t, event) + // Also reset GpuPowerWatch state + powerEvent := helpers.NewHealthEvent(testCtx.NodeName). + WithErrorCode("DCGM_FR_CLOCK_THROTTLE_POWER"). + WithHealthy(true). + WithCheckName("GpuPowerWatch") + helpers.SendHealthEvent(ctx, t, powerEvent) + return helpers.TeardownQuarantineTest(ctx, t, c) })
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (35)
data-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpldistros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yamlfault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-quarantine/pkg/initializer/init.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyihealth-monitors/syslog-health-monitor/main.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/types.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/types.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/types.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gostore-client/pkg/datastore/providers/postgresql/sql_filter_builder.gotests/data/fatal-health-event.jsontests/data/healthy-event.jsontests/data/managed-by-nvsentinel-configmap.yamltests/data/unsupported-health-event.jsontests/event_exporter_test.gotests/fault_quarantine_test.gotests/helpers/event_exporter.gotests/helpers/healthevent.gotests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.go
π§ Files skipped from review as they are similar to previous changes (9)
- health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
- health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
- distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl
- fault-quarantine/pkg/evaluator/rule_evaluator_test.go
- tests/helpers/syslog-health-monitor.go
- health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go
- fault-quarantine/pkg/initializer/init.go
- distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
- tests/event_exporter_test.go
π§° Additional context used
π Path-based instructions (4)
data-models/protobufs/**/*.proto
π CodeRabbit inference engine (.github/copilot-instructions.md)
data-models/protobufs/**/*.proto: Define Protocol Buffer messages indata-models/protobufs/directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages
Files:
data-models/protobufs/health_event.proto
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/syslog-health-monitor/main.gotests/helpers/healthevent.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gotests/fault_quarantine_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.gohealth-monitors/syslog-health-monitor/pkg/sxid/types.gotests/helpers/event_exporter.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gotests/helpers/kube.gostore-client/pkg/datastore/providers/postgresql/sql_filter_builder.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/types.gotests/syslog_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/types.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gotests/fault_quarantine_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gotests/syslog_health_monitor_test.go
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
π§ Learnings (14)
π Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
health-monitors/syslog-health-monitor/main.gohealth-monitors/syslog-health-monitor/pkg/sxid/types.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gotests/syslog_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/types.go
π Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
tests/helpers/healthevent.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/helpers/healthevent.gotests/fault_quarantine_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gotests/syslog_health_monitor_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gotests/fault_quarantine_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gotests/syslog_health_monitor_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
tests/fault_quarantine_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/fault_quarantine_test.gotests/syslog_health_monitor_test.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.gotests/helpers/event_exporter.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
tests/data/managed-by-nvsentinel-configmap.yaml
π Learning: 2025-12-12T07:41:27.339Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:27.339Z
Learning: In tests/data/health-events-analyzer-config.yaml, the XID74Reg2Bit13Set rule intentionally omits the time window filter; tests should verify only the register bit pattern (bit 13 in REG2) on the incoming XID 74 event and should not rely on historical events or counts of repeats. If adding similar rules elsewhere, apply the same pattern and document that the time window filter is unnecessary for single-event bit checks.
Applied to files:
tests/data/managed-by-nvsentinel-configmap.yaml
π Learning: 2025-12-12T07:36:26.109Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:26.109Z
Learning: In NVSentinel health-events analyzer config files (e.g., tests/data/health-events-analyzer-config.yaml and similar), XID 74 errors are guaranteed to include a GPU_UUID entry within healthevent.entitiesimpacted. Therefore, when writing XID 74-specific rules that filter for GPU_UUID entities, you can skip null checks (no $ifNull) for GPU_UUID. Ensure tests relying on this assumption are updated accordingly and document the guarantee in the rule descriptions.
Applied to files:
tests/data/managed-by-nvsentinel-configmap.yaml
π Learning: 2025-12-12T07:38:37.023Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:37.023Z
Learning: In NVSentinel, XID 74 errors always include an NVLINK entry in healthevent.entitiesimpacted, so null-checking with $ifNull is unnecessary when filtering for NVLINK entities in XID 74-specific rules. Apply this rule to YAML test fixtures under tests/ data (e.g., tests/data/health-events-analyzer-config.yaml) and any similar health-event configuration tests. If applying in code, ensure downstream filters rely on the presence of NVLINK in entitiesimpacted for XID 74 only, but continue to guard other fields and XIDs with appropriate null checks.
Applied to files:
tests/data/managed-by-nvsentinel-configmap.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`
Applied to files:
health-monitors/syslog-health-monitor/pkg/sxid/types.go
π Learning: 2025-12-23T05:02:22.108Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: store-client/pkg/client/postgresql_pipeline_builder.go:119-132
Timestamp: 2025-12-23T05:02:22.108Z
Learning: In the NVSentinel codebase, protobuf fields stored in MongoDB should use lowercase field names (e.g., processingstrategy, componentclass, checkname). Ensure pipeline filters and queries that access protobuf fields in the database consistently use lowercase field names in the store-client package, avoiding camelCase mappings for MongoDB reads/writes.
Applied to files:
store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/syslog_health_monitor_test.go
𧬠Code graph analysis (10)
health-monitors/syslog-health-monitor/main.go (1)
data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_value(58-62)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(43-43)ProcessingStrategy(75-77)ProcessingStrategy(79-81)ProcessingStrategy(88-90)ProcessingStrategy_EXECUTE_REMEDIATION(47-47)
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)
data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(48-48)ProcessingStrategy_EXECUTE_REMEDIATION(47-47)
tests/fault_quarantine_test.go (5)
tests/helpers/fault_quarantine.go (3)
QuarantineTestContext(51-54)SetupQuarantineTest(107-112)TeardownQuarantineTest(181-242)tests/helpers/kube.go (3)
SetNodeManagedByNVSentinel(1389-1408)WaitForNodeConditionWithCheckName(1662-1706)WaitForNodeEvent(349-385)tests/helpers/healthevent.go (1)
NewHealthEvent(61-78)tests/helpers/health_events_analyzer.go (1)
SYSLOG_HEALTH_MONITOR_AGENT(38-38)data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(48-48)ProcessingStrategy_EXECUTE_REMEDIATION(47-47)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(43-43)ProcessingStrategy(75-77)ProcessingStrategy(79-81)ProcessingStrategy(88-90)ProcessingStrategy_EXECUTE_REMEDIATION(47-47)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(43-43)ProcessingStrategy(75-77)ProcessingStrategy(79-81)ProcessingStrategy(88-90)ProcessingStrategy_EXECUTE_REMEDIATION(47-47)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (3)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(75-77)ProcessingStrategy(79-81)ProcessingStrategy(88-90)health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
XIDHandler(29-39)
tests/syslog_health_monitor_test.go (2)
tests/helpers/syslog-health-monitor.go (3)
SetUpSyslogHealthMonitor(33-71)StubJournalHTTPPort(27-27)TearDownSyslogHealthMonitor(74-113)tests/helpers/kube.go (3)
InjectSyslogMessages(1820-1843)PortForwardPod(1614-1658)EnsureNodeConditionNotPresent(1797-1818)
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (3)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(75-77)ProcessingStrategy(79-81)ProcessingStrategy(88-90)health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)
SXIDHandler(29-36)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(75-77)ProcessingStrategy(79-81)ProcessingStrategy(88-90)
π Additional comments (57)
tests/data/fatal-health-event.json (1)
20-20: LGTM! Processing strategy addition is appropriate for fatal events.The addition of
processingStrategy: 1(EXECUTE_REMEDIATION) aligns well with the fatal nature of this test event, as fatal health events would typically require remediation actions.store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go (1)
404-404: LGTM! Field mapping follows established conventions.The new
processingstrategyβprocessingStrategymapping correctly handles the case conversion between MongoDB's lowercase bson field names and PostgreSQL's camelCase JSON field names, consistent with the existing pattern and aligns with the project's field naming conventions.Based on learnings, this ensures consistency with how protobuf fields are stored and queried across different data stores.
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
1550-1559: LGTM on test setup and isolation.Good use of local context, clientset, and ring buffer per test case to ensure test isolation. The deferred
close(stopCh)ensures proper cleanup.tests/data/managed-by-nvsentinel-configmap.yaml (1)
43-59: LGTM! The new rule-set structure is correct and properly mirrors the existing GPU health monitor rule.Using the same taint and cordon settings for both
gpu-health-monitorandsyslog-health-monitorXID errors ensures consistent node quarantine behavior regardless of the detection source. The syslog-health-monitor component emits HealthEvents withagent == 'syslog-health-monitor'andcheckName == 'SysLogsXIDError'as expected (extensively verified in the test suite).health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (3)
243-246: LGTM: ProcessingStrategy correctly set for maintenance events.The test expectations properly include
ProcessingStrategy: pb.ProcessingStrategy_EXECUTE_REMEDIATIONfor CSP maintenance events, which aligns with the intended behavior where these events should trigger actual remediation workflows.
272-274: LGTM: Healthy event test updated correctly.The ProcessingStrategy field is appropriately set to EXECUTE_REMEDIATION for healthy/recovery events, maintaining consistency with the quarantine event behavior.
336-338: LGTM: Unknown action case handles ProcessingStrategy consistently.Even for events with unknown recommended actions (which default to NONE), the ProcessingStrategy is correctly set to EXECUTE_REMEDIATION, ensuring the event still participates in the remediation workflow.
data-models/protobufs/health_event.proto (1)
32-39: LGTM: Well-documented ProcessingStrategy enum.The enum definition clearly documents the behavioral distinction between
EXECUTE_REMEDIATION(modifies cluster state) andSTORE_ONLY(observability-only). The three-value design (including UNSPECIFIED) follows protobuf best practices.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
14-18: LGTM: Generated Python type stubs correctly reflect proto changes.The auto-generated Python type stub properly includes the new
ProcessingStrategyenum and integrates it into theHealthEventmessage with appropriate type annotations. The structure is consistent with the protobuf definition.Also applies to: 32-34, 80-80, 106-106, 122-122, 140-140
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
20-20: LGTM: ProcessingStrategy field added cleanly to XIDHandler.The import and struct field addition follow Go conventions. The unexported field appropriately stores the processing strategy configuration for the handler, consistent with other configuration fields like
nodeNameandcheckName.Also applies to: 34-34
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
189-189: LGTM: ProcessingStrategy propagated to HealthEvent.The handler's
processingStrategyis correctly assigned to the generatedHealthEvent, completing the data flow from configuration to emitted events.tests/helpers/healthevent.go (1)
155-158: LGTM: Fluent setter follows existing pattern.The
WithProcessingStrategymethod correctly follows the established fluent builder pattern used by other methods in this struct.health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)
362-364: LGTM: TODO properly references tracking PR.The hardcoded
ProcessingStrategy_EXECUTE_REMEDIATIONvalue is acceptable as an interim implementation, and the TODO comment appropriately references PR #641 for making it configurable via the config file.tests/data/unsupported-health-event.json (1)
20-20: LGTM: Test data includes ProcessingStrategy field.The
processingStrategyvalue of 1 corresponds toEXECUTE_REMEDIATIONand aligns with the new ProcessingStrategy enum support.tests/helpers/event_exporter.go (1)
221-226: LGTM: Validation logic correctly extended.The
ValidateCloudEventfunction signature and implementation are correctly updated to validate theprocessingStrategyfield in CloudEvent payloads, following the existing validation pattern.Also applies to: 245-245
health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)
20-20: LGTM: Proper use of enum type for ProcessingStrategy.The
processingStrategyfield correctly uses thepb.ProcessingStrategyenum type rather thanint, providing type safety for production code. The import aliaspbfollows Go conventions.Also applies to: 34-34
health-monitors/syslog-health-monitor/main.go (3)
70-72: LGTM!The new
--processing-strategyflag is well-defined with a sensible default (EXECUTE_REMEDIATION) and clear documentation of allowed values.
164-171: LGTM!The validation logic correctly uses the protobuf-generated
ProcessingStrategy_valuemap to validate the flag input, provides a clear error message for invalid values, and logs the configured strategy for observability.
173-184: LGTM!The
processingStrategyis correctly passed toNewSyslogMonitor, threading the configuration through to the monitor instance.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)
1-51: Generated protobuf code - no manual review needed.This file is auto-generated by the protocol buffer compiler. The changes correctly reflect the addition of the
ProcessingStrategyenum and the newprocessingStrategyfield inHealthEvent. Ensure this file is regenerated from the source.protofile rather than manually edited.tests/fault_quarantine_test.go (4)
26-32: LGTM!Import for the
protospackage is correctly added to accessProcessingStrategyenum values.
234-251: LGTM!Test setup correctly initializes the quarantine test context and enables NVSentinel management on the node before running assertions.
253-289: LGTM!Good coverage for
STORE_ONLYbehavior - the test correctly verifies that events withSTORE_ONLYprocessing strategy do not trigger quarantine actions (no node condition, no cordoning, no annotation).
291-332: LGTM!Good coverage for
EXECUTE_REMEDIATIONbehavior - the test correctly verifies that events withEXECUTE_REMEDIATIONprocessing strategy trigger the expected quarantine actions (node condition present, cordoned with annotation).health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (2)
30-42: LGTM!The constructor correctly accepts the new
processingStrategyparameter and initializes the handler field. The signature change aligns with the pattern used in other handlers (XID, GPUFallen).
95-110: LGTM!The
ProcessingStrategyis correctly propagated to theHealthEventpayload, ensuring the strategy configured at initialization flows through to emitted events.health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (7)
309-317: LGTM!Test correctly verifies that
NewSyslogMonitoraccepts and stores theprocessingStrategyparameter. The assertion on line 317 confirms the field is properly set.
329-334: LGTM!Test correctly verifies
NewSyslogMonitorWithFactorywith a different strategy (EXECUTE_REMEDIATION), providing coverage for both factory variants.
391-403: LGTM!
TestJournalProcessingLogicupdated to include theprocessingStrategyparameter.
494-507: LGTM!
TestJournalStateManagementupdated to include theprocessingStrategyparameter in both monitor instances.
574-586: LGTM!
TestBootIDChangeHandlingupdated to include theprocessingStrategyparameter.
624-637: LGTM!
TestRunMultipleChecksupdated to include theprocessingStrategyparameter.
665-679: LGTM!
TestGPUFallenOffHandlerInitializationupdated to include theprocessingStrategyparameter.health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (8)
88-91: LGTM!
TestDetermineFatalitycorrectly updated to include theprocessingStrategyparameter.
141-148: LGTM!Table-driven test case for NVRM GPU Map Line correctly includes
processingStrategy.
156-195: Good test coverage forSTORE_ONLYprocessing strategy.The "Valid XID Message" test case correctly uses
STORE_ONLYand validates that theProcessingStrategyis propagated to the event. This provides coverage for the non-default strategy path.
200-232: LGTM!"Valid XID with GPU UUID" test case correctly validates
EXECUTE_REMEDIATIONpropagation.
237-280: LGTM!Error path test cases correctly updated with
processingStrategyparameter.
306-349: LGTM!
TestCreateHealthEventFromResponsecorrectly validates thatProcessingStrategyis included in the generated health event.
351-379: LGTM!"XID Event with GPU UUID" test case validates
ProcessingStrategypropagation with entity enrichment.
427-446: LGTM!
TestNewXIDHandlertable-driven tests correctly updated withprocessingStrategyparameter.health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (4)
17-25: LGTM!New imports for
os,filepath, and protobuf package are correctly added to support the new test functionality.
44-56: LGTM!
TestNewSXIDHandlercorrectly updated to include theprocessingStrategyparameter.
143-196: Excellent test coverage for SXID handler with valid topology.This test validates the complete flow: metadata loading, SXID parsing, GPU resolution via NVSwitch link topology, and
ProcessingStrategypropagation. The temporary metadata file approach is appropriate for isolated testing.One minor observation: the test hardcodes entity indices (3, 4) for GPU assertions. If the entity order changes in the handler, these assertions would break. Consider using a helper to find entities by type, or document why these indices are expected.
198-249: LGTM!
TestProcessLinetable-driven test correctly updated to includeprocessingStrategyparameter and validate its propagation in the event.tests/syslog_health_monitor_test.go (1)
362-422: LGTM! Well-structured STORE_ONLY strategy test.The new test correctly:
- Configures the syslog monitor with STORE_ONLY strategy during setup
- Verifies that XID errors do not trigger node conditions or cordoning
- Restores EXECUTE_REMEDIATION strategy during teardown to avoid affecting subsequent tests
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (3)
41-58: LGTM! Processing strategy parameter correctly threaded.The
processingStrategyparameter is properly added toNewSyslogMonitorand forwarded toNewSyslogMonitorWithFactory.
63-151: LGTM! Processing strategy initialization is correct.The
processingStrategyparameter is:
- Added to the function signature
- Stored in the
SyslogMonitorstruct (line 96)- Passed to all handler constructors (lines 110, 120, 130)
This correctly threads the strategy through the monitor initialization.
792-820: LGTM! Processing strategy correctly set on health events.Line 813 correctly assigns
sm.processingStrategytoevent.ProcessingStrategy, ensuring the strategy is propagated to all health events generated by the monitor.tests/helpers/kube.go (6)
387-409: LGTM!The function correctly uses
require.Neverto assert that a specified event never appears on the node during the timeout period. The logic and error handling are appropriate for a negative assertion in tests.
2232-2273: LGTM!The rollout wait logic correctly verifies that all DaemonSet pods are scheduled, updated, and ready before proceeding. The progress logging is helpful for test debugging.
2353-2383: LGTM!The function correctly handles both
--flag=valueand--flag valueargument styles, including proper bounds checking when inserting values.
2385-2407: LGTM!The function correctly updates existing arguments in place or appends new ones, handling both value and boolean flag styles appropriately.
2409-2433: LGTM!The function correctly removes arguments in both
--flag=valueand--flag valuestyles, properly handling the deletion of both the flag and its associated value when applicable.
2435-2479: LGTM!The function correctly waits for a DaemonSet pod to be fully ready on a worker node, with proper validation that the pod is not being deleted, is running, and has all readiness checks passing.
health-monitors/syslog-health-monitor/pkg/gpufallen/types.go (2)
52-52: LGTM!The
processingStrategyfield is well-integrated into the struct. The field naming follows Go conventions, and its placement before the mutex fields is appropriate for configuration data that's set during initialization.
22-23: LGTM! The import aliaspbfor the protos package follows Go conventions and is correctly structured.
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
Show resolved
Hide resolved
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
Show resolved
Hide resolved
0a1dc9a to
1df5558
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
π€ Fix all issues with AI agents
In
@platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go:
- Line 1509: Two test cases in k8s_platform_connector_test.go set ErrorCode to a
slice containing an empty string (ErrorCode: []string{""}) which should be an
empty slice; update both instances to use an empty slice literal (ErrorCode:
[]string{}) so they match other tests and semantic intent. Locate the ErrorCode
field in the failing test case structs (the occurrences near the existing DCGM
error-code tests) and replace []string{""} with []string{} in both places.
π§Ή Nitpick comments (6)
tests/helpers/kube.go (3)
387-409: Consider failing fast on persistent API errors.If
GetNodeEventsfails repeatedly due to API issues,require.Neverwill continue checking until timeout, masking the real error. Consider tracking consecutive failures and callingt.Fatalfafter a threshold.β»οΈ Proposed enhancement
func EnsureNodeEventNotPresent(ctx context.Context, t *testing.T, c klient.Client, nodeName string, eventType, eventReason string) { t.Helper() + consecutiveErrors := 0 + maxConsecutiveErrors := 3 + require.Never(t, func() bool { events, err := GetNodeEvents(ctx, c, nodeName, eventType) if err != nil { t.Logf("failed to get events for node %s: %v", nodeName, err) + consecutiveErrors++ + if consecutiveErrors >= maxConsecutiveErrors { + t.Fatalf("failed to get events %d times in a row, likely API issue", consecutiveErrors) + } return false } + consecutiveErrors = 0 for _, event := range events.Items { if event.Type == eventType && event.Reason == eventReason { t.Logf("node %s has event %v", nodeName, event) return true } } t.Logf("node %s does not have event %v", nodeName, eventType) return false }, NeverWaitTimeout, WaitInterval, "node %s should not have event %v", nodeName, eventType) }
2317-2318: Clarify or remove the hard-coded sleep.The 10-second sleep after rollout completion seems arbitrary, since
waitForDaemonSetRolloutalready verifies all pods are updated and ready. If this addresses a specific race condition (e.g., waiting for containers to initialize), document it. Otherwise, consider removing it.
2362-2392: Consider simplifying the nested append.Line 2384 uses nested
appendcalls which create two allocations. For test helper code this is acceptable, but it could be slightly more efficient and clearer.β»οΈ Alternative approach
// Match --flag or --flag value style if existingArg == flag { if value != "" { if j+1 < len(container.Args) && !strings.HasPrefix(container.Args[j+1], "-") { container.Args[j+1] = value } else { - container.Args = append(container.Args[:j+1], append([]string{value}, container.Args[j+1:]...)...) + // Insert value after flag + container.Args = append(container.Args, "") + copy(container.Args[j+2:], container.Args[j+1:]) + container.Args[j+1] = value } } return true }tests/helpers/syslog-health-monitor.go (3)
15-15: Add package-level godoc comment.Per Go conventions and coding guidelines, packages should have a package-level documentation comment explaining their purpose.
π Suggested documentation
+// Package helpers provides test utilities for setting up and tearing down +// test environments for NVSentinel health monitors. package helpersBased on coding guidelines, package-level godoc is required for all Go packages.
32-34: Follow Go godoc conventions for exported function comments.Function comments for exported functions should start with the function name according to Go documentation conventions.
π Proposed fix
-// helper function to set up syslog health monitor and port forward to it +// SetUpSyslogHealthMonitor sets up the syslog health monitor test environment by +// optionally updating DaemonSet args, injecting metadata, establishing port-forwarding, +// and configuring node labels. Returns the test node name, syslog pod, and stop channel. func SetUpSyslogHealthMonitor(ctx context.Context, t *testing.T, client klient.Client, args map[string]string) (string, *v1.Pod, chan struct{}) {Based on coding guidelines, function comments are required for all exported Go functions and should follow standard Go conventions.
73-76: Follow Go godoc conventions for exported function comments.Function comments for exported functions should start with the function name and clearly describe what the function does.
π Proposed fix
-// helper function to roll back syslog health monitor daemonset and stop the port forward +// TearDownSyslogHealthMonitor tears down the syslog health monitor test environment by +// stopping port-forwarding, optionally restoring DaemonSet args, restarting the pod, +// cleaning up metadata, and removing node labels. func TearDownSyslogHealthMonitor(ctx context.Context, t *testing.T, client klient.Client, nodeName string, stopChan chan struct{}, args map[string]string, podName string) {Based on coding guidelines, function comments are required for all exported Go functions and should follow standard Go conventions.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (4)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.gotests/helpers/kube.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/syslog_health_monitor_test.go
π§ Learnings (8)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:40.921Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label
Applied to files:
tests/helpers/kube.go
𧬠Code graph analysis (2)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (4)
data-models/pkg/protos/health_event.pb.go (17)
HealthEvent(263-283)HealthEvent(296-296)HealthEvent(311-313)Entity(211-217)Entity(230-230)Entity(245-247)RecommendedAction(92-92)RecommendedAction(142-144)RecommendedAction(146-148)RecommendedAction(155-157)ProcessingStrategy(43-43)ProcessingStrategy(75-77)ProcessingStrategy(79-81)ProcessingStrategy(88-90)HealthEvents(159-165)HealthEvents(178-178)HealthEvents(193-195)health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (5)
HealthEvent(63-141)Entity(55-61)RecommendedAction(20-30)ProcessingStrategy(14-18)HealthEvents(45-53)platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)
NewK8sConnector(47-58)platform-connectors/pkg/pipeline/factory.go (1)
Create(32-39)
tests/syslog_health_monitor_test.go (2)
tests/helpers/syslog-health-monitor.go (2)
SetUpSyslogHealthMonitor(33-71)StubJournalHTTPPort(27-27)tests/helpers/kube.go (2)
InjectSyslogMessages(1820-1843)PortForwardPod(1614-1658)
π Additional comments (9)
tests/helpers/kube.go (5)
2232-2273: LGTM!The rollout verification logic correctly checks that all DaemonSet pods are scheduled, updated, and ready before proceeding.
2323-2360: LGTM!Container validation is properly implemented, and the function correctly handles the retry-on-conflict pattern.
2394-2416: LGTM!The function correctly updates existing arguments or appends new ones.
2418-2442: LGTM!The function correctly removes arguments in both
--flag=valueand--flag valueformats. Thebreakafter removal ensures only the first occurrence is removed, which is appropriate for typical container args.
2444-2488: LGTM!The function properly validates the pod is from the current rollout, not being deleted, running, and fully ready before returning it.
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (3)
1606-1614: Previous concern properly addressed.The code now uses
require.NotEmptybefore accessingnvsentinelConditions[0], which correctly prevents the index-out-of-range panic that was flagged in earlier reviews. The test will fail with a clear message if the expected condition is not found.
1622-1630: Previous concern properly addressed.The code now uses
require.NotEmptybefore accessingevents.Items[0], which correctly prevents the index-out-of-range panic that was flagged in earlier reviews. The test will fail with a clear message if the expected event is not found.
1391-1635: Well-structured test for processing strategy validation.The test provides comprehensive coverage of the new
ProcessingStrategyfield:
- STORE_ONLY events correctly bypass node condition and Kubernetes event creation
- EXECUTE_REMEDIATION events properly create node conditions (fatal) or Kubernetes events (non-fatal)
- Mixed strategy batches are handled correctly
The test follows Go best practices with table-driven structure, proper test isolation via dedicated clientsets, and clear assertions using testify (consistent with existing patterns in this file).
tests/syslog_health_monitor_test.go (1)
361-421: Well-structured test for STORE_ONLY strategy.The new test correctly validates STORE_ONLY behavior by:
- Setting up the monitor with
--processing-strategy: STORE_ONLY- Verifying no node conditions are created after XID injection
- Verifying no node cordoning occurs
- Restoring EXECUTE_REMEDIATION strategy in teardown for proper cleanup
This follows the established test patterns and properly exercises the new processing strategy feature.
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
Show resolved
Hide resolved
1df5558 to
e5a8295
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
π€ Fix all issues with AI agents
In @tests/helpers/kube.go:
- Around line 2326-2366: The RemoveDaemonSetArgs function uses
require.NoError(t, err, ...) which fails the test immediately and prevents
returning the error to the caller; change this to propagate the error like
UpdateDaemonSetArgs by replacing the require.NoError call with standard error
handling (if err != nil { t.Logf("failed to remove args from daemonset %s/%s:
%v", NVSentinelNamespace, daemonsetName, err); return err }) so the function
returns the error to callers while preserving the existing logs and rollout wait
behavior; keep the function signature as RemoveDaemonSetArgs(... ) error and
ensure callers handle the returned error.
π§Ή Nitpick comments (3)
tests/helpers/kube.go (2)
387-409: Consider improving log message specificity.Line 405 logs only
eventType, but the function also receiveseventReasonas a parameter. Including both in the log message would provide more context for debugging.β»οΈ Proposed improvement
- t.Logf("node %s does not have event %v", nodeName, eventType) + t.Logf("node %s does not have event type=%s reason=%s", nodeName, eventType, eventReason)
2320-2321: Consider replacing fixed sleep with condition-based wait.The hardcoded 10-second sleep after rollout completion could be replaced with polling for the specific condition you're waiting for (e.g., pod readiness, specific container state). This would make tests more reliable and potentially faster.
tests/syslog_health_monitor_test.go (1)
361-421: LGTM: Well-structured STORE_ONLY strategy test.The test correctly verifies that STORE_ONLY mode prevents both condition creation and node cordoning when XID errors are injected. The setup/teardown properly configure and restore the processing strategy.
Optional: Make assess description more concise and complete
The description could mention both aspects being verified (condition and cordoning) more concisely:
- feature.Assess("Inject XID errors and verify no node condition is created when running in STORE_ONLY strategy", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { + feature.Assess("Verify STORE_ONLY strategy prevents condition creation and cordoning", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (4)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.go
π§ Files skipped from review as they are similar to previous changes (2)
- tests/helpers/syslog-health-monitor.go
- platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/syslog_health_monitor_test.gotests/helpers/kube.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/syslog_health_monitor_test.go
π§ Learnings (6)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/syslog_health_monitor_test.go
𧬠Code graph analysis (1)
tests/syslog_health_monitor_test.go (2)
tests/helpers/syslog-health-monitor.go (3)
SetUpSyslogHealthMonitor(33-71)StubJournalHTTPPort(27-27)TearDownSyslogHealthMonitor(74-113)tests/helpers/kube.go (3)
InjectSyslogMessages(1820-1843)PortForwardPod(1614-1658)EnsureNodeConditionNotPresent(1797-1818)
π Additional comments (6)
tests/helpers/kube.go (3)
2232-2273: LGTM!The rollout verification logic correctly checks all necessary DaemonSet status fields (DesiredNumberScheduled, UpdatedNumberScheduled, NumberReady) to ensure complete rollout.
2368-2448: LGTM!The container argument manipulation helpers correctly handle multiple argument styles (
--flag=value,--flag value,--flag) with proper slice operations for updates, insertions, and removals.
2450-2494: LGTM!The function properly validates that the DaemonSet pod is running, ready, and not being deleted, ensuring tests interact with stable pods from the current rollout generation.
tests/syslog_health_monitor_test.go (3)
25-39: LGTM: Clean imports and context keys.The helper import and updated context key names improve code clarity and maintainability.
42-173: LGTM: Well-refactored setup and teardown.The test now uses centralized helper functions consistently, improving maintainability. The context-based value passing follows the e2e-framework pattern correctly.
290-359: LGTM: Consistent refactoring pattern.The SXID detection test follows the same centralized setup/teardown pattern as the XID detection test, ensuring consistency across the test suite.
e5a8295 to
0752eb5
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
π€ Fix all issues with AI agents
In @tests/syslog_health_monitor_test.go:
- Around line 405-418: The args map passed to TearDownSyslogHealthMonitor uses a
non-empty value for "--processing-strategy" even though removeArgsFromContainer
(and RemoveDaemonSetArgs) only inspects map keys; change the map entry to
"--processing-strategy": "" (empty string) or add an inline comment next to the
map literal documenting that only keys are used by
removeArgsFromContainer/RemoveDaemonSetArgs so values are ignored, so readers
aren't misled by a non-empty value.
π§Ή Nitpick comments (4)
tests/helpers/kube.go (3)
2281-2293: Unused variableoriginalDaemonSet.The variable
originalDaemonSetis assigned on line 2292 but never used. This appears to be leftover code, possibly from a previous implementation that stored the original state for rollback.β»οΈ Remove unused variable
func UpdateDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, containerName string, args map[string]string) error { t.Helper() - var originalDaemonSet *appsv1.DaemonSet - t.Logf("Updating daemonset %s/%s with args %v", NVSentinelNamespace, daemonsetName, args) err := retry.RetryOnConflict(retry.DefaultRetry, func() error { daemonSet := &appsv1.DaemonSet{} if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil { return err } - if originalDaemonSet == nil { - originalDaemonSet = daemonSet.DeepCopy() - } - containers := daemonSet.Spec.Template.Spec.Containers
2321-2322: Magic sleep after rollout.The 10-second hardcoded sleep after rollout completion is a code smell. If pods need additional time after becoming "Ready", consider either:
- Adding a specific readiness check for the functionality you're waiting for
- Extracting this as a named constant with documentation explaining why it's needed
2452-2496: ParameterdaemonsetNameis underutilized.The
daemonsetNameparameter is only used in the error message on line 2492. The function doesn't verify that the found pod actually belongs to the specified DaemonSet - it relies solely onpodNamePatternmatching.This could be intentional if the pattern is specific enough, but consider either:
- Adding validation that the pod's owner reference matches the DaemonSet
- Removing
daemonsetNameif it's truly not neededtests/helpers/syslog-health-monitor.go (1)
73-113: Consider adding error handling for cleanup operations.The teardown function logs warnings for errors during cleanup (lines 89, 111) but uses
require.NoErrorforRemoveDaemonSetArgs(line 82). This inconsistency could cause test failures during teardown if the arg removal fails, while other cleanup steps are more lenient.For consistency in teardown, consider either:
- Making all cleanup steps warn-only to ensure full cleanup attempts
- Or keeping
require.NoErroronly for critical restoration steps
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (4)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.go
π§ Files skipped from review as they are similar to previous changes (1)
- platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/syslog_health_monitor_test.gotests/helpers/syslog-health-monitor.gotests/helpers/kube.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/syslog_health_monitor_test.go
π§ Learnings (7)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
tests/helpers/kube.go
𧬠Code graph analysis (1)
tests/syslog_health_monitor_test.go (3)
tests/helpers/syslog-health-monitor.go (3)
SetUpSyslogHealthMonitor(33-71)StubJournalHTTPPort(27-27)TearDownSyslogHealthMonitor(74-113)tests/helpers/kube.go (2)
InjectSyslogMessages(1820-1843)PortForwardPod(1614-1658)tests/helpers/fault_quarantine.go (1)
AssertQuarantineState(315-382)
π Additional comments (15)
tests/helpers/kube.go (5)
387-409: LGTM - Clean implementation of negative event assertion.The function correctly uses
require.Neverto assert that a specific event type/reason never appears on a node within the timeout period. The error handling on line 393-396 appropriately returnsfalse(meaning "condition not met, keep checking") rather than failing the test.
2232-2273: LGTM - Robust rollout wait implementation.The function correctly checks all necessary conditions for a complete DaemonSet rollout:
DesiredNumberScheduled > 0UpdatedNumberScheduled == DesiredNumberScheduledNumberReady == DesiredNumberScheduledGood logging at each step for debugging.
2370-2400: Slice insertion logic is correct but complex.The slice manipulation on lines 2391-2393 correctly inserts a value after the flag when the next argument is another flag (starts with
-). The logic handles both--flag=valueand--flag valuestyles appropriately.
2402-2424: LGTM - Clean implementation of arg setting.The function correctly:
- Attempts to update existing args via
tryUpdateExistingArg- Falls back to appending new args if not found
- Handles both
--flag=valueand boolean flag styles
2426-2450: LGTM - Correct arg removal implementation.The function properly handles both
--flag=valueand--flag valuestyles when removing arguments. Breaking after modification avoids issues with concurrent slice modification during iteration.tests/syslog_health_monitor_test.go (7)
36-39: LGTM - Clear context key definitions.The context keys follow Go naming conventions and provide clear identifiers for passing test state between setup, assess, and teardown phases.
47-58: LGTM - Centralized setup pattern.The setup correctly uses the new
helpers.SetUpSyslogHealthMonitorhelper, storing node name, pod name, and stop channel in context for use in assess and teardown phases.
159-170: LGTM - Centralized teardown pattern.The teardown correctly retrieves context values and delegates to
helpers.TearDownSyslogHealthMonitor. Passingnilfor args indicates no DaemonSet argument restoration is needed.
195-203: LGTM - Consistent port configuration.Using
helpers.StubJournalHTTPPortfor both local and remote ports ensures consistency across tests and centralizes the port configuration.
295-359: LGTM - Consistent setup/teardown pattern.The SXID detection test follows the same centralized setup/teardown pattern as the XID detection test, maintaining consistency across the test suite.
361-379: Well-structured STORE_ONLY strategy test.The test correctly:
- Follows naming conventions per coding guidelines
- Sets up with
--processing-strategy=STORE_ONLYargument- Stores context values for teardown
Good use of the centralized setup helper with custom arguments.
381-403: LGTM - Correct STORE_ONLY behavior validation.The assess phase correctly verifies that when running in STORE_ONLY mode:
- XID messages are injected
- No node condition is created (
EnsureNodeConditionNotPresent)- Node is not quarantined/cordoned (
AssertQuarantineState)This aligns with the expected behavior where STORE_ONLY events should be stored but not trigger remediation.
tests/helpers/syslog-health-monitor.go (3)
26-30: LGTM - Clear constant definitions.The constants properly define the port and resource names used by the syslog health monitor test helpers, making them easy to reference consistently across tests.
32-71: LGTM - Well-structured setup helper.The function provides a clean setup sequence:
- Optionally updates DaemonSet args for test configuration
- Locates the syslog health monitor pod on a worker node
- Injects test metadata
- Establishes port-forwarding
- Configures the node label
Good use of
require.NoErrorfor critical setup steps and returning all necessary values for teardown.
91-101: No changes needed - the condition status check logic is correct.The code at lines 93-100 properly checks for
condition.Status == v1.ConditionFalsewith reason"SysLogsXIDErrorIsHealthy"to verify the error has been cleared. This is consistent with the codebase pattern:
Status=Truewith reason"IsNotHealthy"indicates an active errorStatus=Falsewith reason"IsHealthy"indicates the error has been resolved and the condition is no longer activeThe check is semantically correct and follows established conventions throughout the test suite.
0752eb5 to
7952551
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
π€ Fix all issues with AI agents
In
@distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl:
- Around line 103-104: The argument value for the processing strategy is
unquoted and may break YAML parsing; update the template where the args list
includes the "--processing-strategy" entry to wrap {{
$root.Values.processingStrategy }} in quotes (i.e. change the value reference
used with the "--processing-strategy" arg to a quoted string) so it matches
other quoted arguments and ensures YAML safety.
In @platform-connectors/pkg/connectors/kubernetes/process_node_events.go:
- Around line 347-372: The createK8sEvent function can panic if
healthEvent.GeneratedTimestamp is nil; update createK8sEvent in K8sConnector to
defensively check healthEvent.GeneratedTimestamp before calling AsTime() and use
a fallback (e.g. metav1.Now() or zero time) when nil, assign the resulting
metav1.Time to a local variable and reuse it for FirstTimestamp and
LastTimestamp so both fields are safe and consistent.
In @tests/syslog_health_monitor_test.go:
- Around line 405-416: The teardown currently calls
helpers.TearDownSyslogHealthMonitor with
map["--processing-strategy":"EXECUTE_REMEDIATION"], but setup used "STORE_ONLY",
so RemoveDaemonSetArgs won't remove the real flag; change the
TearDownSyslogHealthMonitor call to either pass nil for the args map (to skip
arg removal) or pass "--processing-strategy":"STORE_ONLY" to match the setup;
update the invocation that supplies the args map (the call to
TearDownSyslogHealthMonitor) accordingly so cleanup actually removes the
argument.
π§Ή Nitpick comments (9)
tests/fault_quarantine_test.go (1)
327-336: Consider explicitly settingprocessingStrategyin teardown for clarity.The healthy event relies on the default
EXECUTE_REMEDIATIONstrategy to clear the quarantine state. While this works correctly (per project convention), being explicit improves test readability and makes the intent clearer.π‘ Optional improvement
feature.Teardown(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { event := helpers.NewHealthEvent(testCtx.NodeName). WithErrorCode("79"). WithHealthy(true). WithAgent(helpers.SYSLOG_HEALTH_MONITOR_AGENT). - WithCheckName("SysLogsXIDError") + WithCheckName("SysLogsXIDError"). + WithProcessingStrategy(int(protos.ProcessingStrategy_EXECUTE_REMEDIATION)) helpers.SendHealthEvent(ctx, t, event)tests/helpers/syslog-health-monitor.go (1)
91-101: Clarify the condition check semantics in the comment.The comment says "condition should be cleared" but the code checks
condition.Status == v1.ConditionFalse, which verifies the condition exists with a healthy status rather than being removed. Consider updating the comment or assertion message to match the actual check.π‘ Suggested clarification
} else { - t.Logf("Waiting for SysLogsXIDError condition to be cleared from node %s", nodeName) + t.Logf("Waiting for SysLogsXIDError condition to become healthy on node %s", nodeName) require.Eventually(t, func() bool { condition, err := CheckNodeConditionExists(ctx, client, nodeName, "SysLogsXIDError", "SysLogsXIDErrorIsHealthy") if err != nil { t.Logf("Failed to check node condition: %v", err) return false } return condition != nil && condition.Status == v1.ConditionFalse - }, EventuallyWaitTimeout, WaitInterval, "SysLogsXIDError condition should be cleared") + }, EventuallyWaitTimeout, WaitInterval, "SysLogsXIDError condition should transition to healthy") }tests/helpers/kube.go (3)
2281-2293: Remove unusedoriginalDaemonSetvariable.The variable
originalDaemonSetis assigned on line 2292 but never used afterward. This appears to be leftover code from a potential restore/rollback feature that wasn't implemented.π§Ή Proposed fix
func UpdateDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, containerName string, args map[string]string) error { t.Helper() - var originalDaemonSet *appsv1.DaemonSet - t.Logf("Updating daemonset %s/%s with args %v", NVSentinelNamespace, daemonsetName, args) err := retry.RetryOnConflict(retry.DefaultRetry, func() error { daemonSet := &appsv1.DaemonSet{} if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil { return err } - if originalDaemonSet == nil { - originalDaemonSet = daemonSet.DeepCopy() - } - containers := daemonSet.Spec.Template.Spec.Containers
2321-2322: Consider removing or documenting the hardcoded sleep.The
waitForDaemonSetRolloutalready waits untilNumberReady == DesiredNumberScheduled, which means pods should be ready. The additional 10-second sleep may be unnecessary or indicates that there's additional initialization happening after pod readiness that isn't captured by the rollout status.If this sleep is necessary for container initialization or process startup, consider adding a comment explaining why, or replace it with an explicit readiness check.
2327-2368: Inconsistent error handling -require.NoErrorvs returning error.The function signature returns
error, but line 2360 usesrequire.NoErrorwhich will fail the test immediately if an error occurs. This means the function can never return a non-nil error in practice. Either:
- Remove the error return and always use
require.NoError, or- Return the error and let callers decide how to handle it (consistent with
UpdateDaemonSetArgs)β»οΈ Option 1: Align with UpdateDaemonSetArgs pattern
err := retry.RetryOnConflict(retry.DefaultRetry, func() error { // ... existing code ... }) - require.NoError(t, err, "failed to remove args from daemonset %s/%s", NVSentinelNamespace, daemonsetName) + if err != nil { + return err + } t.Logf("Waiting for daemonset %s/%s rollout to complete after restoration", NVSentinelNamespace, daemonsetName)platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
1422-1442: Consider consolidating near-duplicate test cases.Test cases "STORE_ONLY non-fatal event should not create Kubernetes event" (lines 1423-1442) and "STORE_ONLY non fatal event should not create Kubernetes event" (lines 1503-1524) appear to test very similar scenarios with only minor differences (different CheckName and ErrorCode).
Consider consolidating these into a single parameterized test case or making the distinction between them clearer in the descriptions if they're testing meaningfully different code paths.
Also applies to: 1502-1524
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)
388-435: Test uses EXECUTE_REMEDIATION but doesn't assert strategy on emitted events.The
TestJournalProcessingLogictest creates a handler withEXECUTE_REMEDIATIONbut themockHandler.ProcessLine(lines 444-469) doesn't setProcessingStrategyon the returned event. Consider updating the mock to propagate the strategy for more comprehensive testing.β»οΈ Suggested enhancement for mockHandler
type mockHandler struct { nodeName string defaultAgentName string defaultComponentClass string checkName string + processingStrategy pb.ProcessingStrategy } func (mh *mockHandler) ProcessLine(message string) (*pb.HealthEvents, error) { if !strings.Contains(message, "sxid123") { return nil, nil } event := &pb.HealthEvent{ Version: 1, Agent: mh.defaultAgentName, CheckName: mh.checkName, ComponentClass: mh.defaultComponentClass, GeneratedTimestamp: timestamppb.New(time.Now()), EntitiesImpacted: []*pb.Entity{ {EntityType: "GPU", EntityValue: "44"}, }, Message: "TestMessage", IsFatal: true, IsHealthy: false, NodeName: mh.nodeName, RecommendedAction: pb.RecommendedAction_RESTART_BM, ErrorCode: []string{"123"}, + ProcessingStrategy: mh.processingStrategy, }health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (1)
191-196: Fragile entity assertions using hardcoded array indices.The assertions at lines 192-195 assume specific positions in the
EntitiesImpactedarray ([3]and[4]). If the order of entities changes in the implementation, these tests will fail even if the correct entities are present.Consider using a pattern similar to
TestProcessLineingpufallen_handler_test.go(lines 124-136) which iterates through entities to find them by type rather than assuming positions.β»οΈ Suggested refactor for robust entity assertions
- // Verify GPU entity - assert.Equal(t, "GPU", event.EntitiesImpacted[3].EntityType) - assert.Equal(t, "1", event.EntitiesImpacted[3].EntityValue) - assert.Equal(t, "GPU_UUID", event.EntitiesImpacted[4].EntityType) - assert.Equal(t, "GPU-aaaabbbb-cccc-dddd-eeee-ffffffffffff", event.EntitiesImpacted[4].EntityValue) + // Verify GPU entities by type rather than position + var hasGPU, hasGPUUUID bool + for _, entity := range event.EntitiesImpacted { + switch entity.EntityType { + case "GPU": + hasGPU = true + assert.Equal(t, "1", entity.EntityValue) + case "GPU_UUID": + hasGPUUUID = true + assert.Equal(t, "GPU-aaaabbbb-cccc-dddd-eeee-ffffffffffff", entity.EntityValue) + } + } + assert.True(t, hasGPU, "Should have GPU entity") + assert.True(t, hasGPUUUID, "Should have GPU_UUID entity")platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
377-388: Consider simplifying unusednodeConditionsslice construction.The
nodeConditionsslice is constructed but only used to check if it's non-empty (line 390). The actual node condition updates are performed insideupdateNodeConditionsusing the rawprocessableEvents. You could simplify by counting qualifying events instead of building a full slice.β»οΈ Optional simplification
- var nodeConditions []corev1.NodeCondition - - for _, healthEvent := range processableEvents { - if healthEvent.IsHealthy || healthEvent.IsFatal { - nodeConditions = append(nodeConditions, corev1.NodeCondition{ - Type: corev1.NodeConditionType(healthEvent.CheckName), - LastHeartbeatTime: metav1.NewTime(healthEvent.GeneratedTimestamp.AsTime()), - LastTransitionTime: metav1.NewTime(healthEvent.GeneratedTimestamp.AsTime()), - Message: r.fetchHealthEventMessage(healthEvent), - }) - } - } - - if len(nodeConditions) > 0 { + hasConditionUpdates := false + for _, healthEvent := range processableEvents { + if healthEvent.IsHealthy || healthEvent.IsFatal { + hasConditionUpdates = true + break + } + } + + if hasConditionUpdates {
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (32)
data-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpldistros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yamldocs/designs/025-processing-strategy-for-health-checks.mdhealth-monitors/syslog-health-monitor/main.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/types.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/types.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/types.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.goplatform-connectors/pkg/server/platform_connector_server.goplatform-connectors/pkg/server/platform_connector_server_test.gostore-client/pkg/client/mongodb_pipeline_builder.gostore-client/pkg/client/pipeline_builder.gostore-client/pkg/client/postgresql_pipeline_builder.gotests/event_exporter_test.gotests/fault_quarantine_test.gotests/helpers/event_exporter.gotests/helpers/fault_quarantine.gotests/helpers/healthevent.gotests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.go
π§ Files skipped from review as they are similar to previous changes (8)
- health-monitors/syslog-health-monitor/pkg/gpufallen/types.go
- health-monitors/syslog-health-monitor/main.go
- health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
- tests/helpers/healthevent.go
- health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
- health-monitors/syslog-health-monitor/pkg/sxid/types.go
- health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
- distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
π§° Additional context used
π Path-based instructions (3)
data-models/protobufs/**/*.proto
π CodeRabbit inference engine (.github/copilot-instructions.md)
data-models/protobufs/**/*.proto: Define Protocol Buffer messages indata-models/protobufs/directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages
Files:
data-models/protobufs/health_event.proto
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
platform-connectors/pkg/server/platform_connector_server.goplatform-connectors/pkg/server/platform_connector_server_test.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gotests/helpers/fault_quarantine.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gostore-client/pkg/client/pipeline_builder.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/types.gostore-client/pkg/client/postgresql_pipeline_builder.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gotests/helpers/event_exporter.gostore-client/pkg/client/mongodb_pipeline_builder.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gotests/event_exporter_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gotests/fault_quarantine_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
platform-connectors/pkg/server/platform_connector_server_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/syslog_health_monitor_test.gotests/event_exporter_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gotests/fault_quarantine_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
π§ Learnings (14)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
data-models/protobufs/health_event.protoplatform-connectors/pkg/server/platform_connector_server.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gotests/fault_quarantine_test.godocs/designs/025-processing-strategy-for-health-checks.md
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
data-models/protobufs/health_event.protoplatform-connectors/pkg/server/platform_connector_server_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gotests/helpers/event_exporter.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.godocs/designs/025-processing-strategy-for-health-checks.md
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
platform-connectors/pkg/server/platform_connector_server_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/syslog_health_monitor_test.gotests/event_exporter_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gotests/fault_quarantine_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
π Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/types.gotests/syslog_health_monitor_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
tests/helpers/fault_quarantine.gotests/fault_quarantine_test.go
π Learning: 2025-12-23T05:02:22.108Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: store-client/pkg/client/postgresql_pipeline_builder.go:119-132
Timestamp: 2025-12-23T05:02:22.108Z
Learning: In the NVSentinel codebase, protobuf fields stored in MongoDB should use lowercase field names (e.g., processingstrategy, componentclass, checkname). Ensure pipeline filters and queries that access protobuf fields in the database consistently use lowercase field names in the store-client package, avoiding camelCase mappings for MongoDB reads/writes.
Applied to files:
store-client/pkg/client/pipeline_builder.gostore-client/pkg/client/postgresql_pipeline_builder.gostore-client/pkg/client/mongodb_pipeline_builder.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gotests/syslog_health_monitor_test.gotests/event_exporter_test.gotests/fault_quarantine_test.go
π Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
tests/helpers/kube.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/syslog_health_monitor_test.gotests/event_exporter_test.gotests/fault_quarantine_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.
Applied to files:
docs/designs/025-processing-strategy-for-health-checks.md
π Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.
Applied to files:
docs/designs/025-processing-strategy-for-health-checks.md
𧬠Code graph analysis (14)
platform-connectors/pkg/server/platform_connector_server.go (2)
data-models/pkg/protos/health_event.pb.go (6)
ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)ProcessingStrategy_UNSPECIFIED(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(48-48)health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)
platform-connectors/pkg/server/platform_connector_server_test.go (1)
data-models/pkg/protos/health_event.pb.go (10)
ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)ProcessingStrategy_UNSPECIFIED(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(48-48)ProcessingStrategy_STORE_ONLY(49-49)HealthEvent(264-284)HealthEvent(297-297)HealthEvent(312-314)
tests/helpers/fault_quarantine.go (1)
tests/helpers/kube.go (1)
NVSentinelNamespace(64-64)
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (3)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)
SXIDHandler(29-36)
store-client/pkg/client/pipeline_builder.go (1)
store-client/pkg/client/mongodb_client.go (1)
BuildNonFatalUnhealthyInsertsPipeline(296-299)
store-client/pkg/client/postgresql_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(48-48)
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (2)
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (1)
NewSXIDHandler(30-42)data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_EXECUTE_REMEDIATION(48-48)RecommendedAction_CONTACT_SUPPORT(98-98)
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (5)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go (1)
SyslogMonitor(47-67)health-monitors/syslog-health-monitor/pkg/syslog-monitor/journal_real.go (1)
GetDefaultJournalFactory(104-106)health-monitors/syslog-health-monitor/pkg/syslog-monitor/journal_stub.go (1)
GetDefaultJournalFactory(210-254)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(48-48)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
data-models/pkg/protos/health_event.pb.go (21)
HealthEvent(264-284)HealthEvent(297-297)HealthEvent(312-314)Entity(212-218)Entity(231-231)Entity(246-248)RecommendedAction(93-93)RecommendedAction(143-145)RecommendedAction(147-149)RecommendedAction(156-158)RecommendedAction_CONTACT_SUPPORT(98-98)ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)ProcessingStrategy_STORE_ONLY(49-49)RecommendedAction_NONE(96-96)ProcessingStrategy_EXECUTE_REMEDIATION(48-48)HealthEvents(160-166)HealthEvents(179-179)HealthEvents(194-196)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
data-models/pkg/protos/health_event.pb.go (10)
HealthEvents(160-166)HealthEvents(179-179)HealthEvents(194-196)HealthEvent(264-284)HealthEvent(297-297)HealthEvent(312-314)ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
ValidateCloudEvent(221-247)
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (1)
data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(49-49)ProcessingStrategy_EXECUTE_REMEDIATION(48-48)
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)
data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(49-49)ProcessingStrategy_EXECUTE_REMEDIATION(48-48)
πͺ markdownlint-cli2 (0.18.1)
docs/designs/025-processing-strategy-for-health-checks.md
350-350: Hard tabs
Column: 1
(MD010, no-hard-tabs)
351-351: Hard tabs
Column: 1
(MD010, no-hard-tabs)
353-353: Hard tabs
Column: 1
(MD010, no-hard-tabs)
355-355: Hard tabs
Column: 1
(MD010, no-hard-tabs)
356-356: Hard tabs
Column: 1
(MD010, no-hard-tabs)
357-357: Hard tabs
Column: 1
(MD010, no-hard-tabs)
358-358: Hard tabs
Column: 1
(MD010, no-hard-tabs)
359-359: Hard tabs
Column: 1
(MD010, no-hard-tabs)
360-360: Hard tabs
Column: 1
(MD010, no-hard-tabs)
362-362: Hard tabs
Column: 1
(MD010, no-hard-tabs)
372-372: Hard tabs
Column: 1
(MD010, no-hard-tabs)
373-373: Hard tabs
Column: 1
(MD010, no-hard-tabs)
374-374: Hard tabs
Column: 1
(MD010, no-hard-tabs)
383-383: Hard tabs
Column: 1
(MD010, no-hard-tabs)
436-436: Hard tabs
Column: 1
(MD010, no-hard-tabs)
437-437: Hard tabs
Column: 1
(MD010, no-hard-tabs)
438-438: Hard tabs
Column: 1
(MD010, no-hard-tabs)
439-439: Hard tabs
Column: 1
(MD010, no-hard-tabs)
440-440: Hard tabs
Column: 1
(MD010, no-hard-tabs)
441-441: Hard tabs
Column: 1
(MD010, no-hard-tabs)
442-442: Hard tabs
Column: 1
(MD010, no-hard-tabs)
443-443: Hard tabs
Column: 1
(MD010, no-hard-tabs)
444-444: Hard tabs
Column: 1
(MD010, no-hard-tabs)
445-445: Hard tabs
Column: 1
(MD010, no-hard-tabs)
446-446: Hard tabs
Column: 1
(MD010, no-hard-tabs)
447-447: Hard tabs
Column: 1
(MD010, no-hard-tabs)
448-448: Hard tabs
Column: 1
(MD010, no-hard-tabs)
449-449: Hard tabs
Column: 1
(MD010, no-hard-tabs)
457-457: Hard tabs
Column: 1
(MD010, no-hard-tabs)
458-458: Hard tabs
Column: 1
(MD010, no-hard-tabs)
459-459: Hard tabs
Column: 1
(MD010, no-hard-tabs)
460-460: Hard tabs
Column: 1
(MD010, no-hard-tabs)
461-461: Hard tabs
Column: 1
(MD010, no-hard-tabs)
462-462: Hard tabs
Column: 1
(MD010, no-hard-tabs)
463-463: Hard tabs
Column: 1
(MD010, no-hard-tabs)
464-464: Hard tabs
Column: 1
(MD010, no-hard-tabs)
465-465: Hard tabs
Column: 1
(MD010, no-hard-tabs)
466-466: Hard tabs
Column: 1
(MD010, no-hard-tabs)
467-467: Hard tabs
Column: 1
(MD010, no-hard-tabs)
484-484: Hard tabs
Column: 1
(MD010, no-hard-tabs)
485-485: Hard tabs
Column: 1
(MD010, no-hard-tabs)
486-486: Hard tabs
Column: 1
(MD010, no-hard-tabs)
487-487: Hard tabs
Column: 1
(MD010, no-hard-tabs)
488-488: Hard tabs
Column: 1
(MD010, no-hard-tabs)
489-489: Hard tabs
Column: 1
(MD010, no-hard-tabs)
490-490: Hard tabs
Column: 1
(MD010, no-hard-tabs)
491-491: Hard tabs
Column: 1
(MD010, no-hard-tabs)
492-492: Hard tabs
Column: 1
(MD010, no-hard-tabs)
493-493: Hard tabs
Column: 1
(MD010, no-hard-tabs)
494-494: Hard tabs
Column: 1
(MD010, no-hard-tabs)
495-495: Hard tabs
Column: 1
(MD010, no-hard-tabs)
496-496: Hard tabs
Column: 1
(MD010, no-hard-tabs)
497-497: Hard tabs
Column: 1
(MD010, no-hard-tabs)
505-505: Hard tabs
Column: 1
(MD010, no-hard-tabs)
506-506: Hard tabs
Column: 1
(MD010, no-hard-tabs)
507-507: Hard tabs
Column: 1
(MD010, no-hard-tabs)
508-508: Hard tabs
Column: 1
(MD010, no-hard-tabs)
509-509: Hard tabs
Column: 1
(MD010, no-hard-tabs)
510-510: Hard tabs
Column: 1
(MD010, no-hard-tabs)
511-511: Hard tabs
Column: 1
(MD010, no-hard-tabs)
512-512: Hard tabs
Column: 1
(MD010, no-hard-tabs)
513-513: Hard tabs
Column: 1
(MD010, no-hard-tabs)
514-514: Hard tabs
Column: 1
(MD010, no-hard-tabs)
515-515: Hard tabs
Column: 1
(MD010, no-hard-tabs)
594-594: Hard tabs
Column: 1
(MD010, no-hard-tabs)
595-595: Hard tabs
Column: 1
(MD010, no-hard-tabs)
596-596: Hard tabs
Column: 1
(MD010, no-hard-tabs)
597-597: Hard tabs
Column: 1
(MD010, no-hard-tabs)
598-598: Hard tabs
Column: 1
(MD010, no-hard-tabs)
599-599: Hard tabs
Column: 1
(MD010, no-hard-tabs)
600-600: Hard tabs
Column: 1
(MD010, no-hard-tabs)
601-601: Hard tabs
Column: 1
(MD010, no-hard-tabs)
602-602: Hard tabs
Column: 1
(MD010, no-hard-tabs)
603-603: Hard tabs
Column: 1
(MD010, no-hard-tabs)
604-604: Hard tabs
Column: 1
(MD010, no-hard-tabs)
605-605: Hard tabs
Column: 1
(MD010, no-hard-tabs)
π Additional comments (41)
tests/helpers/fault_quarantine.go (1)
141-145: LGTM!The conditional guard is well-placed. The ConfigMap backup at lines 136-139 still executes unconditionally, ensuring teardown restoration works correctly regardless of whether a new ConfigMap was applied. This cleanly supports tests that need the setup infrastructure (node selection, deployment options) without requiring a custom ConfigMap, aligning with the STORE_ONLY processing strategy testing scenarios.
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
20-21: LGTM!The new
processingStrategyfield is appropriately added as an unexported field alongside the existing handler configuration. The import aliaspbfollows the project's convention for protobuf imports.Also applies to: 34-34
health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go (1)
53-53: LGTM!The
processingStrategyfield is properly added to theSyslogMonitorstruct, following the same pattern as other configuration fields and consistent with the handler structs.tests/fault_quarantine_test.go (3)
234-244: LGTM! Well-structured test for processing strategy behavior.The test properly validates the behavioral difference between
STORE_ONLY(observability-only) andEXECUTE_REMEDIATION(full quarantine actions) processing strategies using the e2e-framework features pattern.
246-282: Good negative test coverage for STORE_ONLY strategy.The test correctly verifies that
STORE_ONLYevents are recorded without triggering node conditions, events, or quarantine actions, validating the observability-only behavior.
284-325: Good positive test coverage for EXECUTE_REMEDIATION strategy.The test correctly verifies that
EXECUTE_REMEDIATIONevents trigger node conditions, events, and quarantine actions as expected.tests/helpers/syslog-health-monitor.go (1)
32-71: LGTM! Well-structured setup helper.The helper properly orchestrates the setup sequence: updating DaemonSet args, locating the pod, injecting metadata, establishing port-forwarding, and configuring node labels. The function returns all necessary handles for teardown.
tests/helpers/kube.go (5)
387-409: LGTM - Well-structured event absence assertion.The function correctly uses
require.Neverto assert that a node event with specific type and reason never appears within the timeout period. The helper pattern matches existing functions likeWaitForNoRebootNodeCR.
2232-2273: LGTM - Correct DaemonSet rollout verification.The function properly checks all three rollout completion conditions: desired pods scheduled, all pods updated, and all pods ready. The unexported status is appropriate for an internal helper.
2372-2400: LGTM - Handles both arg styles correctly.The function properly handles both
--flag=valueand--flag valueargument styles, with appropriate logic for updating existing values.
2402-2450: LGTM - Arg manipulation helpers are well-implemented.Both functions correctly handle argument manipulation with proper handling of both
--flag=valueand--flag valuepatterns.
2452-2496: LGTM - Robust pod retrieval with proper readiness checks.The function correctly waits for a pod that is not being deleted, is in Running phase, and has all containers ready. The delegation to
GetPodOnWorkerNodefor initial retrieval is appropriate.docs/designs/025-processing-strategy-for-health-checks.md (3)
32-40: LGTM - Clear enum definition with appropriate values.The ProcessingStrategy enum is well-defined with:
UNSPECIFIED = 0as proto3 defaultEXECUTE_REMEDIATION = 1as the active processing modeSTORE_ONLY = 2for observability-only modeThe descriptions in lines 33-35 clearly explain each value's purpose.
342-364: LGTM - Normalization strategy is sound.The normalization of
UNSPECIFIEDtoEXECUTE_REMEDIATIONearly in the gRPC handler ensures consistent behavior for legacy or custom monitors that don't set this field. This provides a safe default while maintaining backward compatibility.
582-610: LGTM - Backward compatibility approach is well-designed.The use of
$orto match bothEXECUTE_REMEDIATIONand missingprocessingstrategyfield ensures historical events created before this feature are treated correctly. The comment on line 582-583 clearly explains the rationale.data-models/protobufs/health_event.proto (2)
32-40: LGTM - Well-documented enum with appropriate values.The ProcessingStrategy enum follows proto3 best practices:
UNSPECIFIED = 0as the default value- Clear documentation for each value
- Explains the normalization behavior for UNSPECIFIED
79-79: LGTM - Field addition is wire-compatible.Field number 16 is correctly chosen (sequentially after
drainOverrides = 15). The new field is backward-compatible since proto3 treats missing fields as the default value (0 = UNSPECIFIED), which the platform connector normalizes to EXECUTE_REMEDIATION.platform-connectors/pkg/server/platform_connector_server.go (1)
57-62: LGTM - Correct normalization placement.The normalization loop is correctly placed before both pipeline processing and ring buffer enqueueing. This ensures:
- All events stored in the database have an explicit strategy
- All downstream consumers receive normalized events
- Custom monitors without the field get safe defaults
platform-connectors/pkg/server/platform_connector_server_test.go (1)
25-67: Well-structured table-driven test covering all normalization scenarios.The test correctly validates:
- UNSPECIFIED β EXECUTE_REMEDIATION normalization
- EXECUTE_REMEDIATION preserved
- STORE_ONLY preserved
The test verifies in-place mutation of the
healthEventsobject, which matches the implementation behavior.One consideration: The retrieved learning from this repository suggests avoiding
testifyfor simple equality checks. However, since testify is already established in the library context for this package, this usage is acceptable.tests/event_exporter_test.go (2)
25-26: LGTM!Import placement is correct with standard library imports grouped separately.
85-85: LGTM!The
ValidateCloudEventcall correctly includes the expected"EXECUTE_REMEDIATION"processing strategy, which aligns with the default behavior for health events that should trigger remediation.tests/helpers/event_exporter.go (1)
220-246: LGTM!The updated
ValidateCloudEventfunction correctly adds theexpectedProcessingStrategyparameter and validates it againsthealthEvent["processingStrategy"]. The signature extension follows the existing parameter ordering pattern.store-client/pkg/client/pipeline_builder.go (1)
35-47: LGTM!The new interface methods are well-documented and follow the established naming convention. The comments clearly explain:
BuildProcessableHealthEventInsertsPipelineis for fault-quarantine to ignore STORE_ONLY eventsBuildProcessableNonFatalUnhealthyInsertsPipelineis for health-events-analyzer pattern analysisBoth methods appropriately extend the
PipelineBuilderinterface for the new processing strategy filtering.health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (2)
95-110: LGTM!The
ProcessingStrategyfield is correctly set on theHealthEventfrom the handler's stored value. This ensures the strategy propagates through the health event pipeline for downstream filtering.
30-42: LGTM!The constructor correctly accepts and stores the
processingStrategyparameter. All callersβincluding the test cases and the production caller insyslogmonitor.goβhave been properly updated to pass the new parameter.store-client/pkg/client/mongodb_pipeline_builder.go (3)
17-21: LGTM!Import of
protospackage added correctly for accessingProcessingStrategy_EXECUTE_REMEDIATIONconstant.
129-156: LGTM with same consideration as above.The implementation correctly extends
BuildNonFatalUnhealthyInsertsPipelinewith processing strategy filtering while maintaining the existingagentandishealthyconditions. The$orclause for backward compatibility follows the same pattern asBuildProcessableHealthEventInsertsPipeline.
87-113: The pipeline code is correct as written. UNSPECIFIED enum values will never appear in MongoDB because the platform-connector normalizes them to EXECUTE_REMEDIATION before storage (seeplatform-connectors/pkg/server/platform_connector_server.golines 57-61). Therefore, the$orcondition correctly handles only the two cases that can exist in the database:
- EXECUTE_REMEDIATION (1) for new events
- Missing/null field for old events created before upgrade
No additional conditions are needed;
$exists: falseis the appropriate check for backward compatibility.Likely an incorrect or invalid review comment.
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (2)
165-191: Good test coverage for ProcessingStrategy propagation.The test correctly validates that the
ProcessingStrategypassed toNewGPUFallenHandleris properly propagated to the generatedHealthEvent. The table-driven approach withvalidateEventcallbacks allows for flexible event validation.
221-242: LGTM - Cross-validation of handler strategy with event output.The test at line 241 properly validates that the handler's configured
processingStrategyis correctly propagated to the generated event, providing good coverage for the strategy propagation path.platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
1550-1635: Well-structured test with proper isolation.The test correctly:
- Creates isolated fake clients per test case
- Properly filters NVSentinel-specific conditions from standard K8s conditions
- Validates both node conditions and Kubernetes events based on ProcessingStrategy
- Uses descriptive logging for debugging
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)
309-334: LGTM - Good coverage for ProcessingStrategy initialization.The test correctly validates that:
NewSyslogMonitoraccepts and storesProcessingStrategy_STORE_ONLYNewSyslogMonitorWithFactoryaccepts and storesProcessingStrategy_EXECUTE_REMEDIATIONBoth constructor variants are tested with different strategies, providing good coverage for the new parameter.
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (1)
143-196: Good addition of topology-aware SXID test with ProcessingStrategy.This new test provides valuable coverage for:
- SXID processing with valid NVSwitch topology metadata
- ProcessingStrategy propagation to events
- GPU entity lookup from NVLink topology
The temporary file setup and cleanup via
t.TempDir()is clean and follows best practices.tests/syslog_health_monitor_test.go (1)
361-420: Well-structured E2E test for STORE_ONLY strategy.The test correctly:
- Sets up the syslog monitor with
--processing-strategy: STORE_ONLY- Injects XID messages and verifies no node condition is created
- Verifies the node is not cordoned (quarantine state unchanged)
- Resets to
EXECUTE_REMEDIATIONin teardown to restore the default behavior for other testsThe use of
helpers.EnsureNodeConditionNotPresentandhelpers.AssertQuarantineStateprovides clear negative assertions.platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
327-345: LGTM! Clean filtering implementation for STORE_ONLY events.The function correctly filters out
STORE_ONLYevents with appropriate logging for observability. The log includes relevant context (node, checkName, agent) for debugging.store-client/pkg/client/postgresql_pipeline_builder.go (2)
119-146: LGTM! Well-documented backward-compatible pipeline.The pipeline correctly:
- Uses lowercase field name
processingstrategy(per learnings for MongoDB/PostgreSQL consistency)- Implements backward compatibility via
$orfor missing/null fields- Matches the
EXECUTE_REMEDIATIONstrategy using the protobuf enum value- Clear documentation explains the upgrade path
163-192: LGTM! Consistent implementation with clear documentation.The pipeline correctly extends
BuildNonFatalUnhealthyInsertsPipelinewith processing strategy filtering while maintaining:
- Same backward compatibility pattern
- Consistent field naming
- Clear documentation about health-events-analyzer use case
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (4)
40-58: LGTM! Clean parameter threading through constructor.The
processingStrategyparameter is correctly added toNewSyslogMonitorand forwarded toNewSyslogMonitorWithFactory. The parameter ordering is consistent with the existing pattern.
90-104: LGTM! Correct storage and propagation ofprocessingStrategy.The
processingStrategyis properly stored in theSyslogMonitorstruct and will be available for health event generation.
106-141: LGTM! Consistent propagation to all handler types.All three handler types (XID, SXID, GPUFallen) correctly receive the
processingStrategyparameter, ensuring consistent event generation across all check types.
792-820: LGTM! Correct inclusion ofProcessingStrategyin health events.The
ProcessingStrategyfield is correctly set from the monitor's stored strategy, ensuring all events generated byprepareHealthEventWithActioncarry the configured processing strategy.
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl
Outdated
Show resolved
Hide resolved
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
7952551 to
d8972e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
π€ Fix all issues with AI agents
In @tests/helpers/kube.go:
- Around line 2275-2325: Remove the dead variable and the hard-coded sleep:
delete the unused originalDaemonSet declaration and the assignment to
originalDaemonSet inside UpdateDaemonSetArgs (the DeepCopy where
originalDaemonSet is set) since it is never read, and remove the explicit
time.Sleep(10 * time.Second) after waitForDaemonSetRollout; rely on
waitForDaemonSetRollout to ensure pods are ready (no changes needed to
setArgsOnContainer or the client update logic).
π§Ή Nitpick comments (5)
health-monitors/syslog-health-monitor/main.go (1)
164-171: Consider rejectingUNSPECIFIEDas a valid input.The validation uses
pb.ProcessingStrategy_valuewhich includesUNSPECIFIED(value 0), but the flag description only listsEXECUTE_REMEDIATIONandSTORE_ONLYas valid values. If a user passes--processing-strategy=UNSPECIFIED, it would be accepted silently.π§ Suggested fix to explicitly reject UNSPECIFIED
value, ok := pb.ProcessingStrategy_value[*processingStrategyFlag] if !ok { return fmt.Errorf("unexpected processingStrategy value: %q", *processingStrategyFlag) } + if value == int32(pb.ProcessingStrategy_UNSPECIFIED) { + return fmt.Errorf("processingStrategy cannot be UNSPECIFIED, use EXECUTE_REMEDIATION or STORE_ONLY") + } slog.Info("Event handling strategy configured", "processingStrategy", *processingStrategyFlag)health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (1)
427-446:TestNewXIDHandlerupdated for new constructor signature.The test correctly passes the
processingStrategyparameter. Consider adding an assertion to verifyhandler.processingStrategymatches the input value for completeness.π‘ Optional: Add assertion for processingStrategy field
assert.NotNil(t, handler.pciToGPUUUID) assert.NotNil(t, handler.parser) assert.NotNil(t, handler.metadataReader) + assert.Equal(t, pb.ProcessingStrategy_EXECUTE_REMEDIATION, handler.processingStrategy) }health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (1)
63-75: Consider using an options pattern for constructor readability.The
NewSyslogMonitorWithFactoryfunction now has 11 parameters, which can be challenging to maintain. A functional options pattern or a config struct could improve readability and make future additions easier.π‘ Example: Config struct pattern
type SyslogMonitorConfig struct { NodeName string Checks []CheckDefinition PcClient pb.PlatformConnectorClient DefaultAgentName string DefaultComponentClass string PollingInterval string StateFilePath string JournalFactory JournalFactory XidAnalyserEndpoint string MetadataPath string ProcessingStrategy pb.ProcessingStrategy } func NewSyslogMonitorWithConfig(cfg SyslogMonitorConfig) (*SyslogMonitor, error) { // ... }This is a nice-to-have improvement that could be addressed in a follow-up refactor.
tests/helpers/kube.go (2)
2452-2496: ParameterdaemonsetNameis unused for filtering.The
daemonsetNameparameter is only used in the error message (line 2492) but not for actually filtering or validating that the found pod belongs to the specified DaemonSet. This could lead to returning pods from other DaemonSets ifpodNamePatternis not specific enough.Consider either:
- Using
daemonsetNameto verify the pod's owner reference, or- Removing the parameter if
podNamePatternis sufficient for filtering.β»οΈ Option: Verify pod ownership (if validation is needed)
func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, podNamePattern string) (*v1.Pod, error) { t.Helper() var resultPod *v1.Pod require.Eventually(t, func() bool { // Get the pod pod, err := GetPodOnWorkerNode(ctx, t, client, NVSentinelNamespace, podNamePattern) if err != nil { t.Logf("Failed to get pod: %v", err) return false } + // Verify pod belongs to the expected DaemonSet + belongsToDaemonSet := false + for _, ownerRef := range pod.OwnerReferences { + if ownerRef.Kind == "DaemonSet" && ownerRef.Name == daemonsetName { + belongsToDaemonSet = true + break + } + } + if !belongsToDaemonSet { + t.Logf("Pod %s does not belong to DaemonSet %s", pod.Name, daemonsetName) + return false + } + // Verify pod is not being deleted
2327-2330: Missing godoc comment for exported function.Per coding guidelines, exported Go functions require function comments.
RemoveDaemonSetArgsis missing a godoc comment explaining its purpose.π Add godoc comment
+// RemoveDaemonSetArgs removes the specified arguments from a DaemonSet container +// and waits for the rollout to complete. func RemoveDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, containerName string, args map[string]string, ) error {
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (21)
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpldistros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yamlhealth-monitors/syslog-health-monitor/main.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/types.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/types.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/types.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/event_exporter_test.gotests/helpers/event_exporter.gotests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.go
π§ Files skipped from review as they are similar to previous changes (7)
- health-monitors/syslog-health-monitor/pkg/xid/types.go
- platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
- health-monitors/syslog-health-monitor/pkg/gpufallen/types.go
- health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
- tests/helpers/syslog-health-monitor.go
- health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
- distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl
π§° Additional context used
π Path-based instructions (3)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.gotests/event_exporter_test.gotests/syslog_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/types.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gohealth-monitors/syslog-health-monitor/main.gotests/helpers/event_exporter.gotests/helpers/kube.gohealth-monitors/syslog-health-monitor/pkg/sxid/types.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/event_exporter_test.gotests/syslog_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
π§ Learnings (14)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/event_exporter_test.gotests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/event_exporter_test.gotests/syslog_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/event_exporter_test.gotests/syslog_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
tests/syslog_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/main.gohealth-monitors/syslog-health-monitor/pkg/sxid/types.go
π Learning: 2026-01-12T05:13:19.057Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:19.057Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.
Applied to files:
tests/syslog_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/types.go
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yamlhealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yamltests/helpers/event_exporter.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation
Applied to files:
tests/helpers/kube.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label
Applied to files:
tests/helpers/kube.go
π Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
tests/helpers/kube.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`
Applied to files:
health-monitors/syslog-health-monitor/pkg/sxid/types.go
𧬠Code graph analysis (8)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (2)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(45-45)ProcessingStrategy(77-79)ProcessingStrategy(81-83)ProcessingStrategy(90-92)health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
XIDHandler(29-39)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
ValidateCloudEvent(221-247)
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (3)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(45-45)ProcessingStrategy(77-79)ProcessingStrategy(81-83)ProcessingStrategy(90-92)health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)
SXIDHandler(29-36)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (4)
data-models/pkg/protos/health_event.pb.go (6)
ProcessingStrategy_EXECUTE_REMEDIATION(49-49)ProcessingStrategy_STORE_ONLY(50-50)ProcessingStrategy(45-45)ProcessingStrategy(77-79)ProcessingStrategy(81-83)ProcessingStrategy(90-92)health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
NewXIDHandler(33-58)health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
XIDHandler(29-39)
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)
data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(50-50)ProcessingStrategy_EXECUTE_REMEDIATION(49-49)
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (1)
data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(50-50)ProcessingStrategy_EXECUTE_REMEDIATION(49-49)
health-monitors/syslog-health-monitor/main.go (1)
data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_value(60-64)
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (3)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(45-45)ProcessingStrategy(77-79)ProcessingStrategy(81-83)ProcessingStrategy(90-92)health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go (1)
SyslogMonitor(47-67)
π Additional comments (30)
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml (1)
53-59: LGTM! Well-documented configuration option.The new
processingStrategyvalue is properly documented with inline comments explaining both valid values and their behavioral differences. This follows the Helm chart documentation guidelines.tests/event_exporter_test.go (1)
85-85: LGTM!The updated call correctly passes
"EXECUTE_REMEDIATION"to validate the expected processing strategy in the CloudEvent, aligning with the new helper function signature and the default strategy.tests/helpers/event_exporter.go (1)
220-246: LGTM!The helper function is properly extended to validate the
processingStrategyfield in CloudEvents. The assertion at line 245 correctly usesrequire.Equalfor exact string matching, consistent with the other field validations in this function.tests/syslog_health_monitor_test.go (2)
361-420: LGTM! Good test coverage for STORE_ONLY strategy.The test correctly validates the STORE_ONLY behavior:
- Setup with
--processing-strategy: STORE_ONLY- Verifies no node condition is created after XID injection
- Confirms node quarantine state is unchanged
- Teardown uses
EXECUTE_REMEDIATIONto ensure proper cleanup (since STORE_ONLY won't modify cluster state)The test follows the descriptive naming format per coding guidelines.
159-170: Consistent teardown pattern applied.The centralized
TearDownSyslogHealthMonitorhelper improves test maintainability. The pattern of extracting context values and passing them to the helper is applied consistently across the updated tests.health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)
29-36: LGTM!The
processingStrategyfield is correctly added as an unexported field using the protobuf enum typepb.ProcessingStrategy. This follows the existing struct pattern and ensures type safety for the processing strategy propagation.health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go (1)
46-67: LGTM!The
processingStrategyfield is correctly added toSyslogMonitorstruct, positioned logically with other configuration fields. The field follows the same pattern as other handler types in the PR.health-monitors/syslog-health-monitor/main.go (2)
70-71: LGTM! Clear flag definition.The flag description correctly documents the two valid user-facing values (
EXECUTE_REMEDIATIONorSTORE_ONLY) with an appropriate default.
173-184: LGTM! Proper wiring to NewSyslogMonitor.The validated and converted
processingStrategyis correctly passed to theNewSyslogMonitorconstructor, completing the configuration flow from CLI flag to monitor initialization.health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (2)
30-42: Constructor update forprocessingStrategylooks correct.The new
processingStrategyparameter is properly accepted and stored in the handler struct. The signature change is consistent with the pattern used across other handlers (XID, GPUFallen).
95-110: ProcessingStrategy correctly propagated to HealthEvent.The
processingStrategyfield is properly included in theHealthEventconstruction, ensuring events carry the configured strategy for downstream processing.health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (2)
33-57: Constructor properly updated withprocessingStrategyparameter.The new parameter is correctly positioned, stored in the handler struct, and follows the same pattern as the SXID handler. The implementation is consistent across all handlers in this PR.
175-190: HealthEvent correctly includes ProcessingStrategy.The
processingStrategyis properly propagated to the health event, ensuring consistent behavior with other handlers.health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (3)
153-195: Good test coverage forProcessingStrategypropagation.The test correctly uses
pb.ProcessingStrategy_STORE_ONLYand validates that the generated event carries the same value (Line 194). This ensures the strategy is properly propagated through the handler.
197-232: Test validatesEXECUTE_REMEDIATIONstrategy.This test case complements the previous one by testing
pb.ProcessingStrategy_EXECUTE_REMEDIATION, providing coverage for both primary strategy values.
306-350:TestCreateHealthEventFromResponsevalidates strategy in health event construction.The test verifies that
ProcessingStrategyis included in the generated health event. The assertion on Line 348 confirms the expected behavior.health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (3)
165-190: TestProcessLine correctly validates ProcessingStrategy.The test uses
pb.ProcessingStrategy_STORE_ONLYand verifies the strategy is propagated to the generated event (Line 185). Good coverage of the happy path.
221-242: Direct field access for ProcessingStrategy assertion.Line 241 uses
handler2.processingStrategydirectly, which is valid since this is a test within the same package. This approach correctly validates that the event's ProcessingStrategy matches the handler's configured value.
193-219: XID tracking tests updated with ProcessingStrategy.The XID tracking tests correctly pass the
processingStrategyparameter to the handler constructor. The test logic for suppression behavior remains intact.health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (3)
309-317: Test validates ProcessingStrategy storage in SyslogMonitor.Line 317 correctly asserts that the
processingStrategyis stored in the monitor. This ensures the value is properly passed through and retained.
329-334: NewSyslogMonitorWithFactory tested with EXECUTE_REMEDIATION.Good coverage of both constructor paths with different strategy values, ensuring both
NewSyslogMonitorandNewSyslogMonitorWithFactorycorrectly handle the new parameter.
391-403: TestJournalProcessingLogic updated with processingStrategy.The test correctly passes the strategy to the factory constructor. The test focuses on journal cursor handling, which is orthogonal to the strategy propagation.
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (4)
40-57: NewSyslogMonitor correctly accepts and forwards processingStrategy.The parameter is properly added and forwarded to
NewSyslogMonitorWithFactory. The implementation maintains backward compatibility with the factory pattern.
63-104: NewSyslogMonitorWithFactory stores and propagates processingStrategy.The strategy is correctly stored in the
SyslogMonitorstruct (Line 96) and will be used when creating handler instances. The initialization is consistent with existing patterns.
106-141: Handler creation consistently passes processingStrategy.All three handler types (XID, SXID, GPUFallen) receive the
processingStrategyparameter, ensuring consistent behavior across all check types. The error handling for handler initialization is appropriate.
792-820: prepareHealthEventWithAction includes ProcessingStrategy in health events.Line 813 correctly sets
ProcessingStrategy: sm.processingStrategyin the generated health event. This ensures events created via this path (e.g., healthy events after reboot on Line 312) carry the configured strategy.Based on learnings, healthy events with
EXECUTE_REMEDIATIONare valid when the Fault Quarantine Manager needs to act on them to clear previous fault states.tests/helpers/kube.go (4)
2232-2273: LGTM!The
waitForDaemonSetRolloutfunction correctly waits for a DaemonSet rollout to complete by verifying that all desired pods are scheduled, updated, and ready. The logging is helpful for debugging test failures.
2370-2400: LGTM - logic is correct.The function properly handles both
--flag=valueand--flag valueargument styles. The slice manipulation on line 2392 is complex but correct.
2402-2424: LGTM!The function correctly sets or updates container arguments, handling both new and existing flags.
2426-2450: LGTM!The function correctly removes container arguments, handling both
--flag=valueand--flag valuestyles.
π‘οΈ CodeQL Analysisπ¨ Found 1 security alert(s) π View details |
Signed-off-by: Tanisha goyal <[email protected]>
d8972e9 to
49d8021
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
π€ Fix all issues with AI agents
In @tests/helpers/syslog-health-monitor.go:
- Line 15: Add a package-level godoc comment for the helpers package by
inserting a comment like "// Package helpers ...", placed immediately before the
"package helpers" declaration (you can add it in
tests/helpers/syslog-health-monitor.go or create a new doc.go); ensure the
comment briefly describes the package purpose and follows Go doc convention
starting with "Package helpers".
π§Ή Nitpick comments (5)
tests/helpers/kube.go (4)
2315-2316: Consider replacing fixed sleep with a more deterministic wait.The 10-second sleep after rollout completion is a workaround that may cause flaky tests or unnecessary delays. Consider polling for pod readiness or container startup instead.
However, since this is a test helper and the rollout wait already ensures pods are ready, this may be acceptable if it addresses timing issues with the syslog health monitor initialization.
2321-2360: Inconsistent error handling betweenUpdateDaemonSetArgsandRemoveDaemonSetArgs.
UpdateDaemonSetArgsreturns anerror, whileRemoveDaemonSetArgsusesrequire.NoErrordirectly and returns nothing. This inconsistency makes the API harder to use and understand.Consider aligning the signatures for consistency:
β»οΈ Suggested fix
-func RemoveDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client, +func RemoveDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, containerName string, args map[string]string, -) { +) error { t.Helper() t.Logf("Removing args %v from daemonset %s/%s", args, NVSentinelNamespace, daemonsetName) err := retry.RetryOnConflict(retry.DefaultRetry, func() error { // ... existing code ... }) - require.NoError(t, err, "failed to remove args from daemonset %s/%s", NVSentinelNamespace, daemonsetName) + if err != nil { + return fmt.Errorf("failed to remove args from daemonset %s/%s: %w", NVSentinelNamespace, daemonsetName, err) + } t.Logf("Waiting for daemonset %s/%s rollout to complete after restoration", NVSentinelNamespace, daemonsetName) waitForDaemonSetRollout(ctx, t, client, daemonsetName) t.Log("DaemonSet restored successfully") + return nil }
2418-2442: Parameter type could be simplified.The
argsparameter ismap[string]stringbut only the keys are used (values are ignored viarange args). Consider using[]stringfor clarity, or document that only keys matter.This is a minor inconsistency with
setArgsOnContainerwhich uses both keys and values.
2444-2488: UnuseddaemonsetNameparameter.The
daemonsetNameparameter is only used in the error message at line 2484. The actual pod lookup usespodNamePatternviaGetPodOnWorkerNode. Consider either:
- Removing the unused parameter
- Using it to verify the pod belongs to the expected DaemonSet via owner references
β»οΈ Option 2: Verify DaemonSet ownership
// After getting the pod, verify it belongs to the expected DaemonSet for _, ownerRef := range pod.OwnerReferences { if ownerRef.Kind == "DaemonSet" && ownerRef.Name == daemonsetName { // Pod belongs to expected DaemonSet break } }tests/helpers/syslog-health-monitor.go (1)
32-34: Function comments should follow Go conventions.Per coding guidelines, exported function comments should start with the function name. As per coding guidelines, function comments are required for all exported Go functions.
β»οΈ Suggested fix
-// helper function to set up syslog health monitor and port forward to it +// SetUpSyslogHealthMonitor sets up the syslog health monitor for tests. +// If args is provided, it updates the DaemonSet with the specified arguments. +// Returns the test node name, the syslog pod, and a stop channel for the port-forward. func SetUpSyslogHealthMonitor(ctx context.Context, t *testing.T, client klient.Client, args map[string]string) (string, *v1.Pod, chan struct{}) { -// helper function to roll back syslog health monitor daemonset and stop the port forward +// TearDownSyslogHealthMonitor cleans up the syslog health monitor test environment. +// It stops the port-forward, removes DaemonSet args if provided, restarts the pod, +// cleans up metadata, and removes the ManagedByNVSentinel label. func TearDownSyslogHealthMonitor(ctx context.Context, t *testing.T, client klient.Client,Also applies to: 73-76
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (4)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/syslog_health_monitor_test.gotests/helpers/syslog-health-monitor.gotests/helpers/kube.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/syslog_health_monitor_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
π§ Learnings (11)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/syslog_health_monitor_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
π Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2026-01-12T05:13:19.057Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:19.057Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label
Applied to files:
tests/helpers/kube.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation
Applied to files:
tests/helpers/kube.go
π Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
tests/helpers/kube.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
𧬠Code graph analysis (2)
tests/syslog_health_monitor_test.go (3)
tests/helpers/syslog-health-monitor.go (2)
SetUpSyslogHealthMonitor(33-71)StubJournalHTTPPort(27-27)tests/helpers/kube.go (2)
InjectSyslogMessages(1820-1843)PortForwardPod(1614-1658)tests/helpers/fault_quarantine.go (2)
AssertQuarantineState(317-384)QuarantineAssertion(56-60)
tests/helpers/syslog-health-monitor.go (3)
tests/helpers/kube.go (8)
UpdateDaemonSetArgs(2276-2319)GetDaemonSetPodOnWorkerNode(2444-2488)NVSentinelNamespace(64-64)PortForwardPod(1614-1658)RemoveDaemonSetArgs(2321-2360)EventuallyWaitTimeout(61-61)WaitInterval(63-63)RemoveNodeManagedByNVSentinelLabel(1411-1425)tests/helpers/metadata.go (3)
CreateTestMetadata(59-108)InjectMetadata(110-181)DeleteMetadata(183-229)commons/pkg/auditlogger/auditlogger.go (1)
Log(114-134)
π Additional comments (14)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)
1606-1614: LGTM! Good defensive coding pattern.Using
require.NotEmptybefore accessingnvsentinelConditions[0]prevents potential index out of bounds panics if the slice is unexpectedly empty. Therequire(vsassert) correctly stops test execution immediately, preventing the subsequent[0]access. Usingassert.Emptyfor the negative case is cleaner and more idiomatic.
1622-1630: LGTM! Consistent safe access pattern.Same defensive pattern correctly applied to Kubernetes events assertions -
require.NotEmptyguards the subsequentevents.Items[0]access.tests/helpers/kube.go (3)
2232-2273: LGTM! Well-structured DaemonSet rollout helper.The
waitForDaemonSetRolloutfunction correctly checks all necessary conditions for a complete rollout: DesiredNumberScheduled > 0, UpdatedNumberScheduled matches desired, and NumberReady matches desired. Good progress logging for debugging.
2362-2392: LGTM! Comprehensive argument update logic.The function correctly handles both
--flag=valueand--flag valueargument styles. The slice manipulation for inserting a value after a standalone flag is correct.
2394-2416: LGTM!Clear logic for setting container arguments with proper handling of both existing and new arguments.
tests/syslog_health_monitor_test.go (5)
25-26: LGTM!Import correctly references the helpers package with appropriate alias.
47-58: LGTM!Setup correctly uses the new helper function with
nilargs (default strategy) and properly stores context values for teardown.
184-215: LGTM!This test intentionally uses manual setup instead of
SetUpSyslogHealthMonitorbecause it specifically tests behavior without metadata injection. The setup correctly:
- Gets the syslog pod directly
- Sets up port-forward manually
- Sets
ManagedByNVSentinel=falsewithout injecting metadata
361-420: LGTM! Well-structured test for STORE_ONLY strategy.The test correctly:
- Sets up with
--processing-strategy=STORE_ONLYvia helper args- Injects XID errors that would normally trigger conditions
- Verifies no node condition is created (
EnsureNodeConditionNotPresent)- Verifies node was not cordoned (
AssertQuarantineStatewithExpectCordoned: false)- Tears down by removing the same args that were added
This validates the STORE_ONLY processing strategy correctly prevents remediation actions.
159-170: LGTM!Teardown correctly passes
nilfor args since the setup didn't modify DaemonSet arguments. The helper will skip theRemoveDaemonSetArgscall.tests/helpers/syslog-health-monitor.go (4)
26-30: LGTM!Constants are well-defined and match the expected DaemonSet and container names.
StubJournalHTTPPort = 9091is correctly exported for use in tests.
32-71: LGTM! Well-designed setup helper.The function:
- Conditionally updates DaemonSet args (for strategy testing)
- Gets the pod from the current rollout on a worker node
- Injects test metadata for GPU/NVSwitch topology
- Establishes port-forward for syslog message injection
- Sets the ManagedByNVSentinel label
The return values provide all necessary context for tests and teardown.
103-111: LGTM! Resilient teardown error handling.The teardown appropriately uses warning logs instead of failing assertions for cleanup steps. This ensures:
- Metadata is cleaned up even if pod deletion failed
- Label removal attempts even if earlier steps had issues
- Test teardown doesn't mask actual test failures with cleanup errors
91-100: The condition check logic is correct and follows the established pattern throughout the codebase.When the monitor detects XID errors, it sets the condition with
Reason: "SysLogsXIDErrorIsNotHealthy"andStatus: ConditionTrue. When errors are cleared (as in the teardown after pod restart), the condition changes toReason: "SysLogsXIDErrorIsHealthy"andStatus: ConditionFalse. The code correctly waits for this healthy state transition, as confirmed by the same pattern used in smoke_test.go and other health monitor implementations.
π‘οΈ CodeQL Analysisπ¨ Found 1 security alert(s) π View details |
Signed-off-by: Tanisha goyal <[email protected]>
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Testing
nvs-dgxc-k8s-oci-lhr-dev3by installing nvsentinel frommainchangesSummary by CodeRabbit
βοΈ Tip: You can customize this high-level summary in your review settings.