-
Notifications
You must be signed in to change notification settings - Fork 35
feat: add event handling strategy changes in gpu health monitor #611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add event handling strategy changes in gpu health monitor #611
Conversation
π WalkthroughWalkthroughAdds a processingStrategy option (EXECUTE_REMEDIATION | STORE_ONLY) exposed in Helm values and DaemonSet args, a CLI flag with runtime validation, propagated into PlatformConnectorEventProcessor and embedded in emitted HealthEvent messages; tests and Go test helpers updated to exercise the store-only path. Changes
Sequence Diagram(s)sequenceDiagram
participant Helm as Helm (values.yaml)
participant K8s as Kubernetes (DaemonSet)
participant Pod as Container (gpu-health-monitor)
participant CLI as CLI parser
participant Proc as PlatformConnectorEventProcessor
participant Sink as Platform Connector / Event Sink
Helm->>K8s: render `processingStrategy` into DaemonSet args
K8s->>Pod: start container with `--processing-strategy`
Pod->>CLI: parse args and validate enum
CLI->>Proc: construct processor with processing_strategy
Proc->>Sink: emit HealthEvent { processingStrategy: ... }
Estimated code review effortπ― 4 (Complex) | β±οΈ ~45 minutes Suggested reviewers
Poem
π₯ Pre-merge checks | β 2 | β 1β Failed checks (1 warning)
β Passed checks (2 passed)
βοΈ Tip: You can configure your own custom pre-merge checks in the settings. β¨ Finishing touches
π Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro π Files selected for processing (2)
π§ Files skipped from review as they are similar to previous changes (2)
βοΈ Tip: You can disable this entire section by setting Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. π§ golangci-lint (2.5.0)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
π§Ή Nitpick comments (5)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
325-343: Add godoc for the exported function.The function
filterProcessableEventsis exported (starts with uppercase) but lacks a proper godoc comment. Per coding guidelines, function comments are required for all exported Go functions. The comment on line 325 is present but should follow godoc format.Also, consider nil-safety: if
healthEventsorhealthEvents.Eventsis nil, this could panic.π Suggested improvement
-// filterProcessableEvents filters out STORE_ONLY events that should not create node conditions or K8s events. -func filterProcessableEvents(healthEvents *protos.HealthEvents) []*protos.HealthEvent { +// filterProcessableEvents filters out STORE_ONLY events that should not create node conditions or K8s events. +// It returns only events with processing strategy other than STORE_ONLY. +func filterProcessableEvents(healthEvents *protos.HealthEvents) []*protos.HealthEvent { + if healthEvents == nil || len(healthEvents.Events) == 0 { + return nil + } + var processableEvents []*protos.HealthEvent for _, healthEvent := range healthEvents.Events {tests/gpu_health_monitor_test.go (1)
502-510: Teardown clears "Memory" error that was never injected.The setup only injects an Inforom error (field 84), but the teardown
clearCommandsincludes both Inforom and Memory errors. While this is harmless (clearing a non-existent error is a no-op), it adds unnecessary operations. Consider matching the teardown to what was actually injected.π Suggested simplification
clearCommands := []struct { name string fieldID string value string condition string }{ {"Inforom", "84", "1", "GpuInforomWatch"}, - {"Memory", "395", "0", "GpuMemWatch"}, }health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (1)
125-132: Consider logging the strategy name instead of the integer value.Line 132 logs
processing_strategy_value, which is the integer enum value (e.g.,0or1). For better readability in logs, consider logging the original string name or usingProcessingStrategy.Name().π Suggested improvement
- log.info(f"Event handling strategy configured to: {processing_strategy_value}") + log.info(f"Event handling strategy configured to: {processing_strategy}")This logs the human-readable string (e.g.,
EXECUTE_REMEDIATION) instead of the integer value.tests/helpers/kube.go (2)
2306-2307: Hardcoded sleep after rollout completion is fragile.The 10-second sleep after
waitForDaemonSetRolloutadds unnecessary delay. The rollout wait already ensures pods are ready. If additional startup time is genuinely needed, consider polling for a specific readiness indicator instead.π Suggested improvement
Remove the hardcoded sleep or replace it with a condition-based wait if there's a specific startup behavior to await:
t.Logf("Waiting for daemonset %s/%s rollout to complete", NVSentinelNamespace, daemonsetName) waitForDaemonSetRollout(ctx, t, client, daemonsetName) - t.Logf("Waiting 10 seconds for daemonset pods to start") - time.Sleep(10 * time.Second) - return originalDaemonSet, nil
2270-2272: Add godoc comments for exported functions.Per coding guidelines, exported Go functions require function comments. The new exported functions
UpdateDaemonSetProcessingStrategy,RestoreDaemonSet, andGetDaemonSetPodOnWorkerNodeare missing godoc comments.π Suggested improvement
+// UpdateDaemonSetProcessingStrategy updates the specified container in a DaemonSet to use +// STORE_ONLY processing strategy, waits for rollout completion, and returns the original DaemonSet. func UpdateDaemonSetProcessingStrategy(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, containerName string) (*appsv1.DaemonSet, error) {+// RestoreDaemonSet restores a DaemonSet's containers to their original state and waits for rollout. func RestoreDaemonSet(ctx context.Context, t *testing.T, client klient.Client, originalDaemonSet *appsv1.DaemonSet, daemonsetName string, ) error {+// GetDaemonSetPodOnWorkerNode returns a running, ready pod from the DaemonSet on a real worker node. func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, podNamePattern string) (*v1.Pod, error) {As per coding guidelines, function comments are required for all exported Go functions.
Also applies to: 2312-2314, 2346-2348
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (23)
data-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yamlevent-exporter/pkg/transformer/cloudevents.goevent-exporter/pkg/transformer/cloudevents_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-quarantine/pkg/initializer/init.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/cli.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyihealth-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.pyplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gostore-client/pkg/client/mongodb_pipeline_builder.gostore-client/pkg/client/pipeline_builder.gostore-client/pkg/client/pipeline_builder_test.gostore-client/pkg/client/postgresql_pipeline_builder.gotests/event_exporter_test.gotests/gpu_health_monitor_test.gotests/helpers/healthevent.gotests/helpers/kube.go
π§° Additional context used
π Path-based instructions (7)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-quarantine/pkg/initializer/init.gostore-client/pkg/client/mongodb_pipeline_builder.gostore-client/pkg/client/postgresql_pipeline_builder.gostore-client/pkg/client/pipeline_builder.gostore-client/pkg/client/pipeline_builder_test.gotests/helpers/healthevent.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gotests/event_exporter_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goevent-exporter/pkg/transformer/cloudevents_test.gotests/helpers/kube.goevent-exporter/pkg/transformer/cloudevents.gotests/gpu_health_monitor_test.go
data-models/protobufs/**/*.proto
π CodeRabbit inference engine (.github/copilot-instructions.md)
data-models/protobufs/**/*.proto: Define Protocol Buffer messages indata-models/protobufs/directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages
Files:
data-models/protobufs/health_event.proto
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
store-client/pkg/client/pipeline_builder_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gotests/event_exporter_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goevent-exporter/pkg/transformer/cloudevents_test.gotests/gpu_health_monitor_test.go
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
**/daemonset*.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
Explain DaemonSet variant selection logic in Helm chart documentation
Files:
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
distros/kubernetes/**/*daemonset*.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes usingnodeAffinitybased on kata.enabled label
Regular node DaemonSets should use/var/logvolume mount for file-based logs
Kata node DaemonSets should use/run/log/journaland/var/log/journalvolume mounts for systemd journal
Files:
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/cli.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
π§ Learnings (6)
π Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.
Applied to files:
fault-quarantine/pkg/initializer/init.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/event_exporter_test.gotests/gpu_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/event_exporter_test.gotests/gpu_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `testify/assert` and `testify/require` for assertions in Go tests
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation
Applied to files:
tests/helpers/kube.go
𧬠Code graph analysis (10)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
store-client/pkg/client/postgresql_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
store-client/pkg/client/pipeline_builder_test.go (3)
store-client/pkg/client/pipeline_builder.go (1)
PipelineBuilder(26-47)store-client/pkg/client/mongodb_pipeline_builder.go (1)
NewMongoDBPipelineBuilder(29-31)store-client/pkg/client/postgresql_pipeline_builder.go (1)
NewPostgreSQLPipelineBuilder(29-31)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
ValidateCloudEvent(221-245)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
data-models/pkg/protos/health_event.pb.go (11)
HealthEvents(156-162)HealthEvents(175-175)HealthEvents(190-192)HealthEvent(260-280)HealthEvent(293-293)HealthEvent(308-310)ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
event-exporter/pkg/transformer/cloudevents.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
tests/gpu_health_monitor_test.go (3)
tests/helpers/kube.go (10)
UpdateDaemonSetProcessingStrategy(2271-2310)GetDaemonSetPodOnWorkerNode(2346-2390)NVSentinelNamespace(64-64)SetNodeManagedByNVSentinel(1365-1384)ExecInPod(1539-1574)EnsureNodeConditionNotPresent(1773-1794)RestoreDaemonSet(2312-2344)CheckNodeConditionExists(1259-1276)EventuallyWaitTimeout(61-61)WaitInterval(63-63)tests/helpers/metadata.go (3)
CreateTestMetadata(59-108)InjectMetadata(110-181)DeleteMetadata(183-229)tests/helpers/fault_quarantine.go (2)
AssertQuarantineState(315-382)QuarantineAssertion(56-60)
π Additional comments (44)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)
1391-1506: Well-structured test suite for processing strategy.The test suite properly validates the new
ProcessingStrategyfiltering behavior with good coverage:
- STORE_ONLY events (fatal and non-fatal) correctly don't create node conditions or K8s events
- EXECUTE_REMEDIATION events properly create node conditions
- Mixed strategy scenarios validate selective processing
The table-driven approach and per-test isolation with local clientsets align well with the coding guidelines.
1508-1587: Test implementation looks correct.The test harness correctly:
- Creates isolated environments per test case
- Filters standard K8s node conditions (NodeReady, NodeMemoryPressure, etc.) when counting NVSentinel conditions
- Validates both the presence/absence of node conditions and K8s events
One minor observation: the
descriptionfield is logged at the end but could be incorporated into the assertion failure messages for better debugging context, though the current approach is acceptable.platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
372-416: Implementation correctly integrates filtering.The
processHealthEventsfunction properly:
- Filters events using
filterProcessableEventsfirst- Uses only
processableEventsfor both node condition updates and K8s event creation- Maintains the existing logic for fatal vs non-fatal event handling
One edge case to consider: if
processableEventsis empty (all events are STORE_ONLY), the function returns early without errors, which appears to be the intended behavior.event-exporter/pkg/transformer/cloudevents.go (1)
66-66: Correctly propagates processingStrategy to CloudEvent.The addition of
processingStrategyto the CloudEvent data payload is consistent with how other enum fields (e.g.,recommendedActionon line 61) are handled, using.String()for serialization.fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)
263-263: Test correctly updated for new processingStrategy field.The expected map now includes
processingStrategy: float64(0), which correctly reflects:
- The default enum value
EXECUTE_REMEDIATION = 0when no explicit value is set- The JSON unmarshaling behavior where numbers become
float64ininterface{}store-client/pkg/client/pipeline_builder_test.go (1)
69-86: Test follows established patterns correctly.The new test
TestProcessableHealthEventInsertsPipelineis well-structured and consistent with the existing test patterns in this file:
- Uses table-driven tests for both MongoDB and PostgreSQL implementations
- Properly validates pipeline is non-nil, non-empty, and has exactly 1 stage
- Uses
requirefor critical assertions andassertfor validationsAs per coding guidelines, consider using a more descriptive test name format like
TestPipelineBuilder_ProcessableHealthEventInsertsto align withTestFunctionName_Scenario_ExpectedBehavior.fault-quarantine/pkg/initializer/init.go (1)
66-66: LGTM! Pipeline switch correctly filters for processable events.The change from
BuildAllHealthEventInsertsPipeline()toBuildProcessableHealthEventInsertsPipeline()correctly ensures that fault-quarantine only processes health events withprocessingStrategy=EXECUTE_REMEDIATION, ignoring observability-only (STORE_ONLY) events.distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml (1)
61-62: LGTM! Processing strategy argument properly configured.The
--processing-strategyargument is correctly added with the value sourced from.Values.processingStrategyand properly quoted. This aligns with the PR objectives to enable configurable event handling strategy.distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml (1)
57-62: LGTM! Well-documented configuration option.The
processingStrategyconfiguration is clearly documented with:
- Valid values (EXECUTE_REMEDIATION, STORE_ONLY)
- Default value that maintains backward compatibility
- Clear explanations of each mode's behavior
This follows the coding guidelines for Helm chart documentation.
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml (1)
61-62: LGTM! Consistent with DCGM 3.x template.The
--processing-strategyargument is added consistently with the DCGM 3.x DaemonSet template, ensuring uniform behavior across DCGM versions.store-client/pkg/client/postgresql_pipeline_builder.go (2)
19-19: LGTM! Import added for ProcessingStrategy enum.The import of
github.com/nvidia/nvsentinel/data-models/pkg/protosis necessary to reference theProcessingStrategy_EXECUTE_REMEDIATIONenum value used in the new pipeline.
119-132: LGTM! Pipeline correctly filters for processable events.The new
BuildProcessableHealthEventInsertsPipeline()method:
- Follows the established pipeline pattern from
BuildAllHealthEventInsertsPipeline()- Correctly filters INSERT operations where
processingStrategyequalsEXECUTE_REMEDIATION- Uses the appropriate
int32cast for the protobuf enum valueThis enables PostgreSQL change streams to ignore observability-only events (
STORE_ONLY).event-exporter/pkg/transformer/cloudevents_test.go (2)
69-69: LGTM! Test correctly initializes ProcessingStrategy.The test event properly sets
ProcessingStrategy: pb.ProcessingStrategy_STORE_ONLYto validate that the new field is handled correctly during CloudEvent transformation.
106-108: LGTM! Test validates ProcessingStrategy propagation.The assertion correctly verifies that the
processingStrategyfield appears in the CloudEvent payload with the expected string value "STORE_ONLY", ensuring proper transformation from the protobuf enum.store-client/pkg/client/pipeline_builder.go (1)
35-38: LGTM! Interface extension well-documented.The new
BuildProcessableHealthEventInsertsPipeline()method:
- Is clearly documented with its purpose and use case
- Explains the filtering behavior (processingStrategy=EXECUTE_REMEDIATION)
- References the consumer (fault-quarantine)
- Follows the documentation pattern of other interface methods
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (2)
14-17: LGTM! ProcessingStrategy enum properly defined.The new
ProcessingStrategyenum class withEXECUTE_REMEDIATIONandSTORE_ONLYmembers is correctly structured according to Python protobuf stub conventions.
31-32: LGTM! HealthEvent ProcessingStrategy integration complete.The
processingStrategyfield is properly integrated into theHealthEventmessage with:
- Module-level enum constants (lines 31-32)
- Field in
__slots__(line 78)- Field number constant (line 104)
- Type annotation (line 120)
__init__parameter with correct typing (line 138)This generated stub correctly reflects the protobuf schema changes.
Also applies to: 78-78, 104-104, 120-120, 138-138
tests/event_exporter_test.go (1)
25-26: Import addition looks correct.The import of
"tests/helpers"is properly added and used throughout the test file for helper functions.data-models/protobufs/health_event.proto (2)
32-38: Well-designed enum with appropriate default value.Using
EXECUTE_REMEDIATION = 0as the default is correct designβexisting clients that don't set the field will get the expected remediation behavior. The comments clearly explain each strategy's semantics, which aligns with the coding guidelines for comprehensive field documentation in Protocol Buffer messages.
77-77: Field addition looks correct.The new
processingStrategyfield at position 16 maintains backward compatibility and follows the existing field numbering sequence.store-client/pkg/client/mongodb_pipeline_builder.go (2)
87-100: Pipeline implementation is correct.The method follows existing patterns and correctly filters for
EXECUTE_REMEDIATIONevents using the protobuf enum's integer value. This ensures only events intended for remediation are processed by downstream consumers.
19-19: Import addition is appropriate.The
protosimport is required to referenceProcessingStrategy_EXECUTE_REMEDIATIONconstant.tests/helpers/healthevent.go (2)
48-48: Field addition follows existing patterns.The
ProcessingStrategyfield withomitemptyJSON tag is consistent with other optional fields in the struct. The default zero value maps toEXECUTE_REMEDIATION, which is the expected default behavior.
153-156: Builder method follows established conventions.The
WithProcessingStrategymethod maintains consistency with other builder methods in the file.tests/gpu_health_monitor_test.go (2)
414-463: Test structure and setup are well-organized.The test properly:
- Updates the DaemonSet to use STORE_ONLY strategy
- Waits for rollout completion
- Injects a GPU error to trigger event generation
- Stores context values for teardown
The use of
UpdateDaemonSetProcessingStrategyandGetDaemonSetPodOnWorkerNodehelpers keeps the test readable.
465-481: Test assertions correctly validate STORE_ONLY behavior.The test verifies that with
STORE_ONLYprocessing strategy:
- Node conditions are NOT applied
- Node is NOT cordoned
- No quarantine annotation is present
This validates the core feature that STORE_ONLY events are observed but don't modify cluster state.
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (5)
107-108: Processing strategy parameter correctly added to test initialization.The test now passes
platformconnector_pb2.STORE_ONLYto the processor, enabling verification of the new processing strategy flow.
301-302: Good assertion for processingStrategy propagation.Verifying that
nvlink_failure_event.processingStrategy == platformconnector_pb2.STORE_ONLYconfirms the strategy is correctly propagated through the event pipeline to the generated HealthEvent.
523-524: Test uses EXECUTE_REMEDIATION for connectivity restored scenario.This is appropriateβthe connectivity restored test uses
EXECUTE_REMEDIATIONwhich tests the alternative processing path, providing coverage for both enum values.
549-549: Assertion validates EXECUTE_REMEDIATION propagation.This confirms the test covers the
EXECUTE_REMEDIATIONstrategy path in the restored connectivity scenario.
493-493: Comprehensive assertion for DCGM connectivity failure event.The test verifies all expected fields including the new
processingStrategyfield, ensuring the complete event structure is validated.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)
1-51: LGTM - Generated protobuf code.This is an auto-generated file from the protobuf compiler. The changes correctly reflect the addition of the
ProcessingStrategyenum and theprocessingStrategyfield (field number 16) to theHealthEventmessage, as indicated by the updated serialized descriptor and adjusted byte ranges.health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (3)
24-24: LGTM - Import alias consistent with existing usage.The alias
platformconnector_pb2forhealth_event_pb2maintains consistency with the existing import pattern inplatform_connector.py.
74-80: LGTM - CLI option well-defined.The new
--processing-strategyoption is appropriately configured with a sensible default (EXECUTE_REMEDIATION) and a clear help string describing the valid values.
28-51: LGTM - Parameter threading through event processor initialization.The
processing_strategyparameter is correctly added to_init_event_processorsignature and passed through toPlatformConnectorEventProcessor.tests/helpers/kube.go (4)
2208-2249: LGTM - DaemonSet rollout wait logic.The rollout status checks correctly verify that all desired pods are scheduled, updated, and ready before considering the rollout complete.
2251-2268: LGTM - Container argument update handles both formats.The helper correctly handles both
--processing-strategy=VALUEand--processing-strategy VALUEargument formats, and appends the argument if not present.
2312-2344: LGTM - RestoreDaemonSet correctly handles nil input.The function appropriately handles the case where
originalDaemonSetis nil by logging and returning early.
2346-2390: LGTM - GetDaemonSetPodOnWorkerNode with proper readiness checks.The function correctly verifies the pod is not being deleted, is in Running phase, and passes all readiness conditions before returning.
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (5)
41-66: LGTM - Processing strategy parameter added to constructor.The
processing_strategyparameter is correctly typed withplatformconnector_pb2.ProcessingStrategyand stored as an instance variable for use in event creation.
106-121: LGTM - Processing strategy propagated in connectivity restored events.The
processingStrategyfield is correctly included in theHealthEventfor DCGM connectivity restoration.
206-223: LGTM - Processing strategy propagated in health failure events.The
processingStrategyfield is correctly included when creating health events for GPU failures.
270-287: LGTM - Processing strategy propagated in health recovery events.The
processingStrategyfield is correctly included when creating health events indicating GPU recovery.
366-381: LGTM - Processing strategy propagated in connectivity failure events.The
processingStrategyfield is correctly included in theHealthEventfor DCGM connectivity failures.
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
π§Ή Nitpick comments (5)
tests/helpers/event_exporter.go (1)
221-226: Add function documentation per coding guidelines.The exported function
ValidateCloudEventlacks a godoc comment. As per coding guidelines, "Function comments required for all exported Go functions." Please add documentation explaining the function's purpose and parameters.π Suggested documentation
+// ValidateCloudEvent validates that a CloudEvent has the expected structure and content. +// It checks the CloudEvent spec version, type, source, and validates the embedded HealthEvent +// fields including node name, message, check name, error code, and processing strategy. func ValidateCloudEvent( t *testing.T, event map[string]any, expectedNodeName, expectedMessage, expectedCheckName, expectedErrorCode string, expectedProcessingStrategy string, ) {Based on coding guidelines: "Function comments required for all exported Go functions"
tests/helpers/kube.go (4)
2257-2269: Unused variableoriginalDaemonSet.The
originalDaemonSetvariable is assigned on line 2268 but never used. This appears to be dead code, possibly leftover from a previous implementation that intended to restore the original state.π Proposed fix
err := retry.RetryOnConflict(retry.DefaultRetry, func() error { daemonSet := &appsv1.DaemonSet{} if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil { return err } - if originalDaemonSet == nil { - originalDaemonSet = daemonSet.DeepCopy() - } - containers := daemonSet.Spec.Template.Spec.Containers
2288-2289: Consider removing redundant sleep.The 10-second sleep appears redundant since
waitForDaemonSetRolloutalready waits until all pods are ready (NumberReady == DesiredNumberScheduled). If there's a specific edge case requiring this delay (e.g., waiting for readiness probes to stabilize), consider documenting it; otherwise, this could be removed.
2294-2327: Inconsistent error handling pattern.This function mixes two error handling approaches: it returns an
errorbut also callsrequire.NoErroron line 2319 which will fail the test immediately. This is inconsistent withUpdateDaemonSetArgswhich only returns errors.If the function returns an error, callers should handle it. Using
require.NoErrormakes the return value meaningless since the test fails before returning.π Proposed fix - Option A: Return error consistently (like UpdateDaemonSetArgs)
err := retry.RetryOnConflict(retry.DefaultRetry, func() error { daemonSet := &appsv1.DaemonSet{} if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil { return err } containers := daemonSet.Spec.Template.Spec.Containers for i := range containers { if containers[i].Name == containerName { removeArgsFromContainer(&containers[i], args) break } } return client.Resources().Update(ctx, daemonSet) }) - require.NoError(t, err, "failed to remove args from daemonset %s/%s", NVSentinelNamespace, daemonsetName) + if err != nil { + return fmt.Errorf("failed to remove args from daemonset %s/%s: %w", NVSentinelNamespace, daemonsetName, err) + }
2409-2453: ParameterdaemonsetNameis only used in error message, not for pod selection.The function signature accepts
daemonsetNamebut it's only used in the error message on line 2449. The actual pod selection usespodNamePatternviaGetPodOnWorkerNode. This could be misleading since the function name implies it validates the pod belongs to the specified DaemonSet.Consider either:
- Removing the
daemonsetNameparameter if pattern matching is sufficient- Adding validation that the pod is actually owned by the specified DaemonSet
π Option 1: Remove unused parameter
-func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client, - daemonsetName string, podNamePattern string) (*v1.Pod, error) { +func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client, + podNamePattern string) (*v1.Pod, error) { t.Helper() var resultPod *v1.Pod require.Eventually(t, func() bool { // Get the pod pod, err := GetPodOnWorkerNode(ctx, t, client, NVSentinelNamespace, podNamePattern) if err != nil { t.Logf("Failed to get pod: %v", err) return false } ... }, EventuallyWaitTimeout, WaitInterval, "daemonset pod from current rollout should be running and ready") if resultPod == nil { - return nil, fmt.Errorf("failed to get ready pod for daemonset %s", daemonsetName) + return nil, fmt.Errorf("failed to get ready pod matching pattern %s", podNamePattern) } return resultPod, nil }
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (3)
tests/gpu_health_monitor_test.gotests/helpers/event_exporter.gotests/helpers/kube.go
π§ Files skipped from review as they are similar to previous changes (1)
- tests/gpu_health_monitor_test.go
π§° Additional context used
π Path-based instructions (1)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/helpers/event_exporter.gotests/helpers/kube.go
π§ Learnings (1)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation
Applied to files:
tests/helpers/kube.go
π Additional comments (5)
tests/helpers/event_exporter.go (1)
245-245: LGTM! Validation follows the established pattern.The validation of
processingStrategycorrectly follows the same pattern as other field validations in this function, usingrequire.Equalconsistently.tests/helpers/kube.go (4)
2208-2249: LGTM!The
waitForDaemonSetRolloutfunction correctly polls the DaemonSet status and follows the same patterns as the existingWaitForDeploymentRolloutfunction. The rollout completion checks forDesiredNumberScheduled,UpdatedNumberScheduled, andNumberReadyare appropriate.
2329-2357: LGTM!The
tryUpdateExistingArghelper correctly handles both--flag=valueand--flag valueargument styles. The slice manipulation for inserting a value after a flag is correct.
2359-2381: LGTM!The function correctly sets container arguments, leveraging
tryUpdateExistingArgto handle existing args and appending new ones as needed.
2383-2407: LGTM!The function correctly handles removal of both
--flag=valueand--flag valuestyle arguments, appropriately breaking after modification to avoid issues with slice iteration.
8398875 to
232c636
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
π§Ή Nitpick comments (8)
tests/helpers/healthevent.go (1)
48-48: Consider using the protobuf enum type for type safety.The
ProcessingStrategyfield usesintinstead ofprotos.ProcessingStrategy. While this provides flexibility in tests, it loses type safety and may allow invalid values.Consider whether type safety is valuable here:
π Option to use the enum type
If you prefer compile-time type checking:
+import "github.com/nvidia/nvsentinel/data-models/pkg/protos" + type HealthEventTemplate struct { Version int `json:"version"` Agent string `json:"agent"` ComponentClass string `json:"componentClass,omitempty"` CheckName string `json:"checkName"` IsFatal bool `json:"isFatal"` IsHealthy bool `json:"isHealthy"` Message string `json:"message"` RecommendedAction int `json:"recommendedAction,omitempty"` ErrorCode []string `json:"errorCode,omitempty"` EntitiesImpacted []EntityImpacted `json:"entitiesImpacted,omitempty"` Metadata map[string]string `json:"metadata,omitempty"` QuarantineOverrides *QuarantineOverrides `json:"quarantineOverrides,omitempty"` NodeName string `json:"nodeName"` - ProcessingStrategy int `json:"processingStrategy,omitempty"` + ProcessingStrategy protos.ProcessingStrategy `json:"processingStrategy,omitempty"` }Then update the builder:
-func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate { +func (h *HealthEventTemplate) WithProcessingStrategy(strategy protos.ProcessingStrategy) *HealthEventTemplate { h.ProcessingStrategy = strategy return h }tests/helpers/kube.go (1)
2312-2313: Consider if the 10-second sleep is necessary.After
waitForDaemonSetRolloutcompletes, all pods are confirmed updated and ready. The additional 10-second sleep may be unnecessary unless there's a specific stabilization requirement not covered by the readiness checks.If the sleep is for pod initialization beyond readiness, consider adding a comment explaining why. Otherwise, this delay might be removable:
t.Logf("Waiting for daemonset %s/%s rollout to complete", NVSentinelNamespace, daemonsetName) waitForDaemonSetRollout(ctx, t, client, daemonsetName) - t.Logf("Waiting 10 seconds for daemonset pods to start") - time.Sleep(10 * time.Second) - return nil }docs/postgresql-schema.sql (1)
106-109: Consider adding constraints for data integrity.The
processing_strategycolumn is nullable and has no constraints. Consider adding:
- A
CHECKconstraint to ensure only valid enum values are stored- A
NOT NULLconstraint with a default value if every health event should have a strategyπ Option to add constraints
If you want to enforce valid values at the database level:
-- Metadata created_at TIMESTAMPTZ DEFAULT NOW(), - updated_at TIMESTAMPTZ DEFAULT NOW(), + updated_at TIMESTAMPTZ DEFAULT NOW() NOT NULL, -- Event handling strategy - processing_strategy VARCHAR(50) + processing_strategy VARCHAR(50) NOT NULL DEFAULT 'EXECUTE_REMEDIATION' + CHECK (processing_strategy IN ('EXECUTE_REMEDIATION', 'STORE_ONLY')) );This prevents invalid values and ensures consistency, but reduces flexibility if new enum values are added later without a migration.
tests/gpu_health_monitor_test.go (1)
488-489: Use defined constants instead of hardcoded strings.Lines 488-489 use hardcoded strings
"gpu-health-monitor-dcgm-4.x"and"gpu-health-monitor"instead of the constantsGPUHealthMonitorDaemonSetNameandGPUHealthMonitorContainerNamedefined at lines 42-43.π Suggested fix
- err = helpers.RemoveDaemonSetArgs(ctx, t, client, "gpu-health-monitor-dcgm-4.x", "gpu-health-monitor", map[string]string{ + err = helpers.RemoveDaemonSetArgs(ctx, t, client, GPUHealthMonitorDaemonSetName, GPUHealthMonitorContainerName, map[string]string{ "--processing-strategy": "EXECUTE_REMEDIATION"})tests/platform-connector_test.go (2)
28-32: Remove unused struct fields.
ConfigMapBackupandTestNamespacefields are defined but never used in the test. Consider removing them to keep the code clean.π Suggested fix
type PlatformConnectorTestContext struct { NodeName string - ConfigMapBackup []byte - TestNamespace string }
98-101: Teardown sends healthy event but doesn't verify cleanup of STORE_ONLY events.The teardown only sends a healthy event. Consider verifying that any state from the STORE_ONLY test cases is properly cleaned up, or add a comment explaining why no cleanup verification is needed.
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (1)
125-132: Good error handling for invalid processing strategy.The validation correctly:
- Uses protobuf's
Value()method to validate and convert the string- Catches
ValueErrorfor invalid inputs- Logs all valid options to help users correct their configuration
- Exits with code 1 on invalid input
One minor note: Line 132 logs
processing_strategy_valuewhich is the integer enum value. Consider logging the string name for better readability.π Optional: Log the strategy name for better readability
- log.info(f"Event handling strategy configured to: {processing_strategy_value}") + log.info(f"Event handling strategy configured to: {processing_strategy}")health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (1)
125-135: Consider adding processingStrategy assertion to first test case.The
test_health_event_occurredtest at lines 125-135 verifies event properties but doesn't assert onprocessingStrategy. While other tests cover this, adding an assertion here would ensure complete coverage.π Suggested addition
if event.checkName == "GpuInforomWatch" and event.isHealthy == False: assert event.errorCode[0] == "DCGM_FR_CORRUPT_INFOROM" assert event.entitiesImpacted[0].entityValue == "0" assert event.recommendedAction == platformconnector_pb2.RecommendedAction.COMPONENT_RESET + assert event.processingStrategy == platformconnector_pb2.STORE_ONLY else:
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (17)
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yamldistros/kubernetes/nvsentinel/values-tilt-postgresql.yamldocs/postgresql-schema.sqlfault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-quarantine/pkg/initializer/init.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/cli.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.pytests/event_exporter_test.gotests/fault_quarantine_test.gotests/gpu_health_monitor_test.gotests/helpers/event_exporter.gotests/helpers/healthevent.gotests/helpers/kube.gotests/platform-connector_test.go
π§ Files skipped from review as they are similar to previous changes (4)
- fault-quarantine/pkg/initializer/init.go
- tests/event_exporter_test.go
- distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
- distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
π§° Additional context used
π Path-based instructions (4)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/helpers/event_exporter.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gotests/helpers/healthevent.gotests/gpu_health_monitor_test.gotests/helpers/kube.gotests/fault_quarantine_test.gotests/platform-connector_test.go
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-quarantine/pkg/evaluator/rule_evaluator_test.gotests/gpu_health_monitor_test.gotests/fault_quarantine_test.gotests/platform-connector_test.go
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/cli.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
π§ Learnings (8)
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
tests/helpers/event_exporter.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gotests/platform-connector_test.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
fault-quarantine/pkg/evaluator/rule_evaluator_test.gotests/gpu_health_monitor_test.gotests/fault_quarantine_test.gotests/platform-connector_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/gpu_health_monitor_test.gotests/fault_quarantine_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/gpu_health_monitor_test.gotests/fault_quarantine_test.gotests/platform-connector_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label
Applied to files:
tests/helpers/kube.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
tests/fault_quarantine_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/platform-connector_test.go
π Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.
Applied to files:
tests/platform-connector_test.go
𧬠Code graph analysis (2)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
tests/fault_quarantine_test.go (4)
tests/helpers/fault_quarantine.go (4)
QuarantineTestContext(51-54)SetupQuarantineTest(107-112)AssertQuarantineState(315-382)QuarantineAssertion(56-60)tests/helpers/kube.go (1)
SetNodeManagedByNVSentinel(1389-1408)tests/helpers/healthevent.go (3)
NewHealthEvent(60-76)SendHealthEvent(263-275)SendHealthyEvent(277-287)data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
π Additional comments (28)
tests/helpers/kube.go (5)
387-409: LGTM!The
EnsureNodeEventNotPresenthelper correctly usesrequire.Neverto assert that a specific event type and reason combination never appears on a node during the test window. The implementation properly queries events and checks both type and reason.
2232-2273: LGTM!The
waitForDaemonSetRolloutfunction correctly implements DaemonSet rollout verification by checking that all desired pods are scheduled, updated, and ready. The logic matcheskubectl rollout statusbehavior and includes helpful progress logging.
2318-2351: LGTM!The
RemoveDaemonSetArgsfunction correctly removes specified arguments from a DaemonSet container and waits for the rollout to complete. Usesretry.RetryOnConflictappropriately without error wrapping.
2353-2431: LGTM!The argument manipulation helpers (
tryUpdateExistingArg,setArgsOnContainer,removeArgsFromContainer) correctly handle various command-line argument formats:
--flag=valuestyle--flag valuestyle (separate entries)--flagstyle (boolean flags)The logic properly preserves argument order and handles edge cases like updating existing args vs adding new ones.
2433-2477: LGTM!The
GetDaemonSetPodOnWorkerNodehelper correctly retrieves a ready DaemonSet pod with proper validation:
- Verifies pod is not being deleted (
DeletionTimestamp == nil)- Confirms pod is in
Runningphase- Checks pod readiness conditions
- Uses
require.Eventuallyfor reliable pollingThis defensive checking improves test stability by ensuring the pod from the current rollout is fully operational.
fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)
263-263: LGTM!The test correctly updates the expected map to include the new
processingStrategyfield with valuefloat64(0), matching the default enum valueProcessingStrategy_EXECUTE_REMEDIATION. Usingfloat64is correct for JSON unmarshaling behavior.distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml (1)
219-222: LGTM!The schema addition matches the canonical source in
docs/postgresql-schema.sql. The file header correctly documents the sync process usingmake update-helm-postgres-schemaandmake validate-postgres-schema.tests/helpers/event_exporter.go (2)
245-245: LGTM!The assertion correctly validates that the
processingStrategyfield in the CloudEvent data matches the expected value. The comparison works properly since the JSON-unmarshaled value will be a string.
221-226: All callers ofValidateCloudEventhave been properly updated. The function call attests/event_exporter_test.go:85correctly passes all 7 required parameters, including the newexpectedProcessingStrategyparameter ("EXECUTE_REMEDIATION"). No outdated calls remain in the codebase.tests/fault_quarantine_test.go (4)
233-250: LGTM!The test setup correctly:
- Uses the existing
SetupQuarantineTesthelper- Sets the node as managed by NVSentinel (required for quarantine logic)
- Follows the established pattern from other tests in this file
252-268: LGTM!The STORE_ONLY assessment correctly validates that events with
ProcessingStrategy_STORE_ONLY:
- Do NOT cause the node to be cordoned
- Do NOT add quarantine annotations
This properly tests the observability-only behavior where events are stored but don't modify cluster state.
270-286: LGTM!The EXECUTE_REMEDIATION assessment correctly validates that events with
ProcessingStrategy_EXECUTE_REMEDIATION:
- DO cause the node to be cordoned
- DO add quarantine annotations
This properly tests the normal remediation behavior where the system takes corrective actions.
288-295: LGTM!The teardown correctly:
- Sends a healthy event to clear the quarantine state from the EXECUTE_REMEDIATION assessment
- Uses
TeardownQuarantineTestto restore the original configuration and clean up test resourcesThis ensures proper test isolation and cleanup.
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml (1)
57-62: LGTM! Clear documentation and sensible default.The
processingStrategyconfiguration option is well-documented with:
- Clear valid values:
EXECUTE_REMEDIATION,STORE_ONLY- Sensible default:
EXECUTE_REMEDIATION(maintains backward compatibility)- Clear behavior description for each mode
Both DaemonSet templates (
daemonset-dcgm-3.x.yamlanddaemonset-dcgm-4.x.yaml) correctly reference this value using{{ .Values.processingStrategy | quote }}.tests/gpu_health_monitor_test.go (3)
34-48: LGTM on constants organization.The constants are well-organized with exported names for reuse. The separation of DCGM-related constants from GPU health monitor constants improves readability.
413-462: Test setup and error injection logic is well-structured.The test correctly:
- Configures the DaemonSet with
STORE_ONLYstrategy- Waits for the pod to be ready
- Injects test metadata and sets the node label
- Injects a DCGM Inforom error to trigger the health monitor
The flow aligns with the PR objective of verifying STORE_ONLY events are stored without triggering remediation.
464-480: LGTM on assess phase.The assertions correctly verify that:
- Node conditions are not applied when using STORE_ONLY strategy
- Node is not cordoned
This validates the expected behavior of the STORE_ONLY processing strategy.
tests/platform-connector_test.go (1)
53-96: Test thoroughly covers both processing strategies.The assess phase correctly validates:
- STORE_ONLY events don't apply node conditions or emit events
- EXECUTE_REMEDIATION events do apply conditions and events
The test uses both fatal (ERRORCODE_79) and non-fatal (ERRORCODE_31) error codes to cover different scenarios.
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (2)
74-80: CLI option for processing strategy is well-designed.The option:
- Has a sensible default (
EXECUTE_REMEDIATION) for backward compatibility- Provides clear help text describing valid values
- Is marked as optional
37-51: Function signature updated correctly with type hint.The
processing_strategyparameter is properly typed withplatformconnector_pb2.ProcessingStrategyand passed through to thePlatformConnectorEventProcessorconstructor. As per coding guidelines, type hints are required for all functions in Python code.health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (3)
107-108: LGTM on test update for processing_strategy parameter.The test correctly passes
platformconnector_pb2.STORE_ONLYas the processing strategy to the event processor constructor.
301-302: Good assertion on processingStrategy propagation.The test verifies that the
processingStrategyfield on the emittedHealthEventmatches the strategy configured in the processor (STORE_ONLY).
523-549: Test correctly verifies EXECUTE_REMEDIATION strategy propagation.This test case uses
EXECUTE_REMEDIATIONstrategy and verifies the restored event has the correctprocessingStrategyfield. Good coverage of both strategy values across different test cases.health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (5)
51-66: Processing strategy parameter and storage are correctly implemented.The constructor:
- Accepts the typed
processing_strategyparameter- Stores it as
self._processing_strategyfollowing Python naming conventions for protected attributesAs per coding guidelines, type hints are required for all functions in Python code, which is satisfied here.
106-121: processingStrategy correctly added to connectivity restored event.The
clear_dcgm_connectivity_failuremethod properly includesprocessingStrategy=self._processing_strategyin the HealthEvent message.
206-223: processingStrategy correctly added to health event for entity failures.The HealthEvent created when entity failures are detected properly includes the processing strategy.
270-287: processingStrategy correctly added to healthy status events.The HealthEvent created for healthy (PASS) status properly includes the processing strategy.
366-381: processingStrategy correctly added to DCGM connectivity failure event.The
dcgm_connectivity_failedmethod properly includesprocessingStrategy=self._processing_strategyin the HealthEvent message. All four HealthEvent creation sites now consistently include the processing strategy.
232c636 to
9c0336d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (1)
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (1)
42-66: Add return type hints to all methods and parameter type hints to error_code parameter.Per the coding guidelines requiring "Type hints required for all functions in Python code," several methods in the class are missing return type annotations:
clear_dcgm_connectivity_failure()health_event_occurred()get_recommended_action_from_dcgm_error_map()(also missing type hint forerror_codeparameter)send_health_event_with_retries()dcgm_connectivity_failed()The
processing_strategyparameter integration itself is correctβall instantiations provide the parameter, and it's properly used in HealthEvent creations. However, the class must be updated to fully comply with type hint requirements.
π€ Fix all issues with AI agents
In @docs/designs/025-processing-strategy-for-health-checks.md:
- Around line 598-608: The pipeline example uses string values for
healthevent.processingstrategy; change those comparisons to use the integer
constant used in the implementation by replacing "EXECUTE_REMEDIATION" with
int32(protos.ProcessingStrategy_EXECUTE_REMEDIATION) (and ensure any other
processingstrategy comparisons follow the same pattern), keeping the rest of the
pipeline (including the $exists false branch) intact; update the pipeline
variable and any example snippets referencing healthevent.processingstrategy
accordingly.
In
@platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go:
- Around line 1620-1626: The test accesses events.Items[0] without ensuring
events.Items is non-empty which can panic; update the test in the block using
tc.expectKubernetesEvents to first assert or require that len(events.Items) > 0
(or use assert.NotEmpty/require.NotEmpty) before reading events.Items[0], then
compare events.Items[0].Type to tc.expectedEventType; similarly, if you assert
non-empty use the appropriate testing helper (require.* if you want to stop on
failure) so the subsequent access is safe.
- Around line 1606-1612: The test currently accesses nvsentinelConditions[0]
when tc.expectNodeConditions is true, which can panic if the slice is empty;
update the assertion to first assert that len(nvsentinelConditions) > 0 (e.g.,
assert.Greater(t, len(nvsentinelConditions), 0, ...)) before comparing
string(nvsentinelConditions[0].Type) to tc.expectedConditionType so the test
fails cleanly rather than panicking; ensure messages reference
nvsentinelConditions and tc.expectedConditionType for clarity.
π§Ή Nitpick comments (10)
tests/helpers/kube.go (4)
387-409: Missing function comment for exported function.Per coding guidelines, all exported Go functions require function comments. Also, the log message on line 405 could be more precise by including the
eventReason.Suggested improvement
+// EnsureNodeEventNotPresent asserts that a node does NOT have an event with the specified type and reason +// within the NeverWaitTimeout period. func EnsureNodeEventNotPresent(ctx context.Context, t *testing.T, c klient.Client, nodeName string, eventType, eventReason string) { t.Helper() // ... existing code ... - t.Logf("node %s does not have event %v", nodeName, eventType) + t.Logf("node %s does not have event type=%s reason=%s", nodeName, eventType, eventReason)
2275-2316: Unused variable and hardcoded sleep.
originalDaemonSet(line 2281) is assigned but never used - appears to be dead code.- The hardcoded
time.Sleep(10 * time.Second)on line 2313 is a code smell. If rollout is complete, pods should already be ready. Consider removing this or documenting why it's necessary.- Missing function comment for exported function (per coding guidelines).
Suggested fix
+// UpdateDaemonSetArgs updates the specified container's arguments in a DaemonSet +// and waits for the rollout to complete. func UpdateDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, containerName string, args map[string]string) error { t.Helper() - var originalDaemonSet *appsv1.DaemonSet - t.Logf("Updating daemonset %s/%s with args %v", NVSentinelNamespace, daemonsetName, args) err := retry.RetryOnConflict(retry.DefaultRetry, func() error { daemonSet := &appsv1.DaemonSet{} if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil { return err } - if originalDaemonSet == nil { - originalDaemonSet = daemonSet.DeepCopy() - } - containers := daemonSet.Spec.Template.Spec.Containers // ... rest of function }) // ... waitForDaemonSetRollout(ctx, t, client, daemonsetName) - t.Logf("Waiting 10 seconds for daemonset pods to start") - time.Sleep(10 * time.Second) - return nil }
2318-2351: Inconsistent error handling pattern.The function returns
errorbut usesrequire.NoErroron line 2343, which will fail the test immediately if an error occurs. This makes the error return value unreachable on failure. Either:
- Remove the error return and use
require.NoErrorconsistently (likewaitForDaemonSetRollout)- Or return the error and let the caller handle it
Also missing function comment for exported function.
Option 1: Remove error return (consistent with test helper pattern)
-func RemoveDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client, +// RemoveDaemonSetArgs removes the specified arguments from a DaemonSet container +// and waits for the rollout to complete. +func RemoveDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, containerName string, args map[string]string, -) error { +) { t.Helper() // ... existing code ... require.NoError(t, err, "failed to remove args from daemonset %s/%s", NVSentinelNamespace, daemonsetName) // ... t.Log("DaemonSet restored successfully") - - return nil }
2433-2477: Unused parameter and missing function comment.The
daemonsetNameparameter is unused in the function body (only appears in the error message on line 2473). Either:
- Use it to verify the pod belongs to the correct DaemonSet via owner references
- Or remove it from the signature
Also missing function comment for exported function.
Suggested fix (verify ownership)
+// GetDaemonSetPodOnWorkerNode returns a running, ready pod from the specified DaemonSet +// that is scheduled on a real worker node. func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, podNamePattern string) (*v1.Pod, error) { t.Helper() var resultPod *v1.Pod require.Eventually(t, func() bool { // Get the pod pod, err := GetPodOnWorkerNode(ctx, t, client, NVSentinelNamespace, podNamePattern) if err != nil { t.Logf("Failed to get pod: %v", err) return false } + // Verify pod belongs to the expected DaemonSet + belongsToDaemonSet := false + for _, ref := range pod.OwnerReferences { + if ref.Kind == "DaemonSet" && ref.Name == daemonsetName { + belongsToDaemonSet = true + break + } + } + if !belongsToDaemonSet { + t.Logf("Pod %s does not belong to daemonset %s", pod.Name, daemonsetName) + return false + } + // Verify pod is not being deleted // ... rest of functiondata-models/protobufs/health_event.proto (1)
79-79: Add field comment for processingStrategy.Per coding guidelines, Protocol Buffer messages should include comprehensive comments for all fields. While the enum itself is well-documented, the field on HealthEvent lacks a comment explaining its purpose.
π Suggested comment
BehaviourOverrides drainOverrides = 15; - ProcessingStrategy processingStrategy = 16; + // Processing strategy defines how downstream modules should handle this event. + // STORE_ONLY events are for observability only and should not modify cluster state. + ProcessingStrategy processingStrategy = 16; }platform-connectors/pkg/server/platform_connector_server_test.go (1)
22-22: Consider using standard testing package instead of testify.Based on learnings from this repository, testify should be avoided for simple equality/inequality checks. These assertions are straightforward and could use the standard testing package.
β»οΈ Proposed refactor using standard testing
import ( "context" "testing" pb "github.com/nvidia/nvsentinel/data-models/pkg/protos" - "github.com/stretchr/testify/assert" )Then update assertions:
- assert.NoError(t, err) - assert.Equal(t, tt.expectedStrategy, healthEvents.Events[0].ProcessingStrategy) + if err != nil { + t.Errorf("unexpected error: %v", err) + } + if healthEvents.Events[0].ProcessingStrategy != tt.expectedStrategy { + t.Errorf("ProcessingStrategy = %v, want %v", healthEvents.Events[0].ProcessingStrategy, tt.expectedStrategy) + }tests/fault_quarantine_test.go (1)
327-336: Consider explicitly settingProcessingStrategyfor the teardown healthy event.The teardown sends a healthy event to clear the quarantine state but doesn't specify a
ProcessingStrategy. Based on the AI summary, the platform connector normalizesUNSPECIFIEDtoEXECUTE_REMEDIATION, so this should work correctly. However, for consistency and explicit intent, consider adding.WithProcessingStrategy(int(protos.ProcessingStrategy_EXECUTE_REMEDIATION))to ensure the healthy event clears the fault state as intended.Based on learnings, healthy events can legitimately use
EXECUTE_REMEDIATIONwhen the Fault Quarantine Manager needs to act on them to clear previous fault states.β»οΈ Suggested improvement
feature.Teardown(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { event := helpers.NewHealthEvent(testCtx.NodeName). WithErrorCode("79"). WithHealthy(true). WithAgent(helpers.SYSLOG_HEALTH_MONITOR_AGENT). - WithCheckName("SysLogsXIDError") + WithCheckName("SysLogsXIDError"). + WithProcessingStrategy(int(protos.ProcessingStrategy_EXECUTE_REMEDIATION)) helpers.SendHealthEvent(ctx, t, event) return helpers.TeardownQuarantineTest(ctx, t, c) })tests/gpu_health_monitor_test.go (1)
732-733: Use the newly defined constants instead of hardcoded strings.The teardown uses hardcoded strings for the DaemonSet and container names, but the constants
GPUHealthMonitorDaemonSetNameandGPUHealthMonitorContainerNamewere defined at the top of the file specifically for this purpose.β»οΈ Proposed fix
- err = helpers.RemoveDaemonSetArgs(ctx, t, client, "gpu-health-monitor-dcgm-4.x", "gpu-health-monitor", map[string]string{ + err = helpers.RemoveDaemonSetArgs(ctx, t, client, GPUHealthMonitorDaemonSetName, GPUHealthMonitorContainerName, map[string]string{platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
1502-1524: Test case appears to duplicate case 2.This test case ("STORE_ONLY non fatal event should not create Kubernetes event") at lines 1502-1524 is very similar to the test case at lines 1422-1442 ("STORE_ONLY non-fatal event should not create Kubernetes event"). Both test that STORE_ONLY non-fatal events don't create Kubernetes events.
If this is intentional for additional coverage with different error codes, consider making the distinction clearer. Otherwise, consider removing the duplicate.
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (1)
125-132: Consider logging the strategy name instead of the numeric value.Line 132 logs
processing_strategy_valuewhich is an integer (e.g.,1forEXECUTE_REMEDIATION). For better readability in logs, consider logging the original string or the enum name.β»οΈ Proposed improvement
- log.info(f"Event handling strategy configured to: {processing_strategy_value}") + log.info(f"Event handling strategy configured to: {processing_strategy}")Or to show both:
log.info(f"Event handling strategy configured to: {processing_strategy} ({processing_strategy_value})")
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (27)
data-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yamldocs/designs/025-processing-strategy-for-health-checks.mdfault-quarantine/pkg/initializer/init.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/cli.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyihealth-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.pyplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.goplatform-connectors/pkg/server/platform_connector_server.goplatform-connectors/pkg/server/platform_connector_server_test.gostore-client/pkg/client/mongodb_pipeline_builder.gostore-client/pkg/client/pipeline_builder.gostore-client/pkg/client/postgresql_pipeline_builder.gostore-client/pkg/datastore/providers/postgresql/sql_filter_builder.gotests/event_exporter_test.gotests/fault_quarantine_test.gotests/gpu_health_monitor_test.gotests/helpers/event_exporter.gotests/helpers/fault_quarantine.gotests/helpers/kube.go
π§ Files skipped from review as they are similar to previous changes (4)
- tests/helpers/event_exporter.go
- store-client/pkg/client/postgresql_pipeline_builder.go
- tests/event_exporter_test.go
- distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
π§° Additional context used
π Path-based instructions (7)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
platform-connectors/pkg/server/platform_connector_server.gofault-quarantine/pkg/initializer/init.gotests/helpers/fault_quarantine.goplatform-connectors/pkg/server/platform_connector_server_test.gostore-client/pkg/datastore/providers/postgresql/sql_filter_builder.gostore-client/pkg/client/pipeline_builder.gotests/gpu_health_monitor_test.gostore-client/pkg/client/mongodb_pipeline_builder.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.gotests/fault_quarantine_test.gotests/helpers/kube.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
platform-connectors/pkg/server/platform_connector_server_test.gotests/gpu_health_monitor_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/fault_quarantine_test.go
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
data-models/protobufs/**/*.proto
π CodeRabbit inference engine (.github/copilot-instructions.md)
data-models/protobufs/**/*.proto: Define Protocol Buffer messages indata-models/protobufs/directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages
Files:
data-models/protobufs/health_event.proto
**/daemonset*.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
Explain DaemonSet variant selection logic in Helm chart documentation
Files:
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
distros/kubernetes/**/*daemonset*.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes usingnodeAffinitybased on kata.enabled label
Regular node DaemonSets should use/var/logvolume mount for file-based logs
Kata node DaemonSets should use/run/log/journaland/var/log/journalvolume mounts for systemd journal
Files:
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
π§ Learnings (12)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
platform-connectors/pkg/server/platform_connector_server.godistros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yamlhealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.gotests/fault_quarantine_test.godata-models/protobufs/health_event.protodocs/designs/025-processing-strategy-for-health-checks.mdhealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
π Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.
Applied to files:
platform-connectors/pkg/server/platform_connector_server.gofault-quarantine/pkg/initializer/init.godocs/designs/025-processing-strategy-for-health-checks.md
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
tests/helpers/fault_quarantine.gotests/gpu_health_monitor_test.gotests/fault_quarantine_test.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
platform-connectors/pkg/server/platform_connector_server_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.godata-models/protobufs/health_event.protodocs/designs/025-processing-strategy-for-health-checks.mdhealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
platform-connectors/pkg/server/platform_connector_server_test.gotests/gpu_health_monitor_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/fault_quarantine_test.go
π Learning: 2025-12-23T05:02:22.108Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: store-client/pkg/client/postgresql_pipeline_builder.go:119-132
Timestamp: 2025-12-23T05:02:22.108Z
Learning: In the NVSentinel codebase, protobuf fields stored in MongoDB should use lowercase field names (e.g., processingstrategy, componentclass, checkname). Ensure pipeline filters and queries that access protobuf fields in the database consistently use lowercase field names in the store-client package, avoiding camelCase mappings for MongoDB reads/writes.
Applied to files:
store-client/pkg/datastore/providers/postgresql/sql_filter_builder.gostore-client/pkg/client/pipeline_builder.gostore-client/pkg/client/mongodb_pipeline_builder.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
platform-connectors/pkg/connectors/kubernetes/process_node_events.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/fault_quarantine_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/fault_quarantine_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label
Applied to files:
tests/helpers/kube.go
π Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.
Applied to files:
docs/designs/025-processing-strategy-for-health-checks.md
𧬠Code graph analysis (10)
platform-connectors/pkg/server/platform_connector_server.go (1)
data-models/pkg/protos/health_event.pb.go (6)
ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)ProcessingStrategy_UNSPECIFIED(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(48-48)
tests/helpers/fault_quarantine.go (1)
tests/helpers/kube.go (1)
NVSentinelNamespace(64-64)
platform-connectors/pkg/server/platform_connector_server_test.go (1)
data-models/pkg/protos/health_event.pb.go (10)
ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)ProcessingStrategy_UNSPECIFIED(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(48-48)ProcessingStrategy_STORE_ONLY(49-49)HealthEvent(264-284)HealthEvent(297-297)HealthEvent(312-314)
store-client/pkg/client/pipeline_builder.go (1)
store-client/pkg/client/mongodb_client.go (1)
BuildNonFatalUnhealthyInsertsPipeline(296-299)
tests/gpu_health_monitor_test.go (2)
tests/helpers/kube.go (7)
UpdateDaemonSetArgs(2276-2316)GetDaemonSetPodOnWorkerNode(2433-2477)NVSentinelNamespace(64-64)ExecInPod(1563-1598)EnsureNodeConditionNotPresent(1797-1818)RemoveDaemonSetArgs(2318-2351)RemoveNodeManagedByNVSentinelLabel(1411-1425)tests/helpers/metadata.go (1)
CreateTestMetadata(59-108)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(48-48)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
data-models/pkg/protos/health_event.pb.go (7)
HealthEvent(264-284)HealthEvent(297-297)HealthEvent(312-314)ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)ProcessingStrategy_EXECUTE_REMEDIATION(48-48)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)ProcessingStrategy_EXECUTE_REMEDIATION(48-48)
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)
πͺ markdownlint-cli2 (0.18.1)
docs/designs/025-processing-strategy-for-health-checks.md
350-350: Hard tabs
Column: 1
(MD010, no-hard-tabs)
351-351: Hard tabs
Column: 1
(MD010, no-hard-tabs)
353-353: Hard tabs
Column: 1
(MD010, no-hard-tabs)
355-355: Hard tabs
Column: 1
(MD010, no-hard-tabs)
356-356: Hard tabs
Column: 1
(MD010, no-hard-tabs)
357-357: Hard tabs
Column: 1
(MD010, no-hard-tabs)
358-358: Hard tabs
Column: 1
(MD010, no-hard-tabs)
359-359: Hard tabs
Column: 1
(MD010, no-hard-tabs)
360-360: Hard tabs
Column: 1
(MD010, no-hard-tabs)
362-362: Hard tabs
Column: 1
(MD010, no-hard-tabs)
372-372: Hard tabs
Column: 1
(MD010, no-hard-tabs)
373-373: Hard tabs
Column: 1
(MD010, no-hard-tabs)
374-374: Hard tabs
Column: 1
(MD010, no-hard-tabs)
383-383: Hard tabs
Column: 1
(MD010, no-hard-tabs)
436-436: Hard tabs
Column: 1
(MD010, no-hard-tabs)
437-437: Hard tabs
Column: 1
(MD010, no-hard-tabs)
438-438: Hard tabs
Column: 1
(MD010, no-hard-tabs)
439-439: Hard tabs
Column: 1
(MD010, no-hard-tabs)
440-440: Hard tabs
Column: 1
(MD010, no-hard-tabs)
441-441: Hard tabs
Column: 1
(MD010, no-hard-tabs)
442-442: Hard tabs
Column: 1
(MD010, no-hard-tabs)
443-443: Hard tabs
Column: 1
(MD010, no-hard-tabs)
444-444: Hard tabs
Column: 1
(MD010, no-hard-tabs)
445-445: Hard tabs
Column: 1
(MD010, no-hard-tabs)
446-446: Hard tabs
Column: 1
(MD010, no-hard-tabs)
447-447: Hard tabs
Column: 1
(MD010, no-hard-tabs)
448-448: Hard tabs
Column: 1
(MD010, no-hard-tabs)
449-449: Hard tabs
Column: 1
(MD010, no-hard-tabs)
457-457: Hard tabs
Column: 1
(MD010, no-hard-tabs)
458-458: Hard tabs
Column: 1
(MD010, no-hard-tabs)
459-459: Hard tabs
Column: 1
(MD010, no-hard-tabs)
460-460: Hard tabs
Column: 1
(MD010, no-hard-tabs)
461-461: Hard tabs
Column: 1
(MD010, no-hard-tabs)
462-462: Hard tabs
Column: 1
(MD010, no-hard-tabs)
463-463: Hard tabs
Column: 1
(MD010, no-hard-tabs)
464-464: Hard tabs
Column: 1
(MD010, no-hard-tabs)
465-465: Hard tabs
Column: 1
(MD010, no-hard-tabs)
466-466: Hard tabs
Column: 1
(MD010, no-hard-tabs)
467-467: Hard tabs
Column: 1
(MD010, no-hard-tabs)
484-484: Hard tabs
Column: 1
(MD010, no-hard-tabs)
485-485: Hard tabs
Column: 1
(MD010, no-hard-tabs)
486-486: Hard tabs
Column: 1
(MD010, no-hard-tabs)
487-487: Hard tabs
Column: 1
(MD010, no-hard-tabs)
488-488: Hard tabs
Column: 1
(MD010, no-hard-tabs)
489-489: Hard tabs
Column: 1
(MD010, no-hard-tabs)
490-490: Hard tabs
Column: 1
(MD010, no-hard-tabs)
491-491: Hard tabs
Column: 1
(MD010, no-hard-tabs)
492-492: Hard tabs
Column: 1
(MD010, no-hard-tabs)
493-493: Hard tabs
Column: 1
(MD010, no-hard-tabs)
494-494: Hard tabs
Column: 1
(MD010, no-hard-tabs)
495-495: Hard tabs
Column: 1
(MD010, no-hard-tabs)
496-496: Hard tabs
Column: 1
(MD010, no-hard-tabs)
497-497: Hard tabs
Column: 1
(MD010, no-hard-tabs)
505-505: Hard tabs
Column: 1
(MD010, no-hard-tabs)
506-506: Hard tabs
Column: 1
(MD010, no-hard-tabs)
507-507: Hard tabs
Column: 1
(MD010, no-hard-tabs)
508-508: Hard tabs
Column: 1
(MD010, no-hard-tabs)
509-509: Hard tabs
Column: 1
(MD010, no-hard-tabs)
510-510: Hard tabs
Column: 1
(MD010, no-hard-tabs)
511-511: Hard tabs
Column: 1
(MD010, no-hard-tabs)
512-512: Hard tabs
Column: 1
(MD010, no-hard-tabs)
513-513: Hard tabs
Column: 1
(MD010, no-hard-tabs)
514-514: Hard tabs
Column: 1
(MD010, no-hard-tabs)
515-515: Hard tabs
Column: 1
(MD010, no-hard-tabs)
594-594: Hard tabs
Column: 1
(MD010, no-hard-tabs)
595-595: Hard tabs
Column: 1
(MD010, no-hard-tabs)
596-596: Hard tabs
Column: 1
(MD010, no-hard-tabs)
597-597: Hard tabs
Column: 1
(MD010, no-hard-tabs)
598-598: Hard tabs
Column: 1
(MD010, no-hard-tabs)
599-599: Hard tabs
Column: 1
(MD010, no-hard-tabs)
600-600: Hard tabs
Column: 1
(MD010, no-hard-tabs)
601-601: Hard tabs
Column: 1
(MD010, no-hard-tabs)
602-602: Hard tabs
Column: 1
(MD010, no-hard-tabs)
603-603: Hard tabs
Column: 1
(MD010, no-hard-tabs)
604-604: Hard tabs
Column: 1
(MD010, no-hard-tabs)
605-605: Hard tabs
Column: 1
(MD010, no-hard-tabs)
π Additional comments (46)
tests/helpers/kube.go (4)
2232-2273: LGTM!The
waitForDaemonSetRolloutfunction correctly verifies DaemonSet rollout by checkingDesiredNumberScheduled,UpdatedNumberScheduled, andNumberReadystatus fields. Good use oft.Helper()and comprehensive logging.
2353-2381: LGTM!The function correctly handles multiple argument styles (
--flag=value,--flag,--flag value). The slice manipulation for inserting values is correct.
2383-2405: LGTM!The function correctly updates existing arguments or appends new ones. Good use of the helper function
tryUpdateExistingArgfor deduplication.
2407-2431: LGTM!The function correctly handles removal of arguments in multiple formats. The use of
breakafter slice modification prevents index corruption issues.store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go (1)
404-404: LGTM!The field name mapping is consistent with the existing pattern and correctly maps the lowercase MongoDB bson field name to the camelCase PostgreSQL JSON field name. Based on learnings, this aligns with the requirement to use lowercase field names for protobuf fields stored in MongoDB.
tests/helpers/fault_quarantine.go (1)
141-145: LGTM!The conditional guard allows tests to skip custom configmap application while still backing up the existing configuration. This enables more flexible test setup for scenarios like
TestFaultQuarantineWithProcessingStrategywhere the default configuration is sufficient.data-models/protobufs/health_event.proto (1)
32-40: LGTM!The enum is well-designed following proto3 conventions with
UNSPECIFIED=0as the default. The comments clearly document the normalization behavior (platform-connector defaults UNSPECIFIED to EXECUTE_REMEDIATION) and the semantic distinction between the strategies.store-client/pkg/client/mongodb_pipeline_builder.go (2)
87-113: LGTM!The pipeline correctly uses lowercase field name
processingstrategyfor MongoDB queries, aligning with the codebase convention. The backward compatibility approach using$orto match bothEXECUTE_REMEDIATIONand missing fields ensures historical events without the field are still processed correctly.
129-156: LGTM!This pipeline correctly extends the existing
BuildNonFatalUnhealthyInsertsPipelinepattern with processingStrategy filtering. The approach maintains consistency withBuildProcessableHealthEventInsertsPipelineand preserves the agent exclusion filter for health-events-analyzer.docs/designs/025-processing-strategy-for-health-checks.md (2)
102-104: LGTM!The design document clearly specifies the enum values and their semantics. The
UNSPECIFIED=0default with normalization toEXECUTE_REMEDIATIONis a sound design choice that ensures backward compatibility with custom monitors that don't set the field.
582-583: LGTM!The backward compatibility explanation is clear and accurately describes why the
$orpattern is needed for the health-events-analyzer queries but not for other modules that only process newly inserted events (which will always have the field set by platform-connector normalization).platform-connectors/pkg/server/platform_connector_server.go (1)
57-62: LGTM! Sensible default for backward compatibility.The normalization of
UNSPECIFIEDtoEXECUTE_REMEDIATIONensures backward compatibility with custom monitors that don't explicitly setprocessingStrategy. The mutation occurs before pipeline processing and ring buffer enqueue, ensuring all downstream consumers see the normalized value.health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)
362-364: Appropriate default with documented follow-up.Hardcoding
EXECUTE_REMEDIATIONis correct for CSP maintenance events since they need to trigger quarantine/recovery workflows. The TODO with a specific PR reference (#641) properly tracks the planned configurability enhancement.platform-connectors/pkg/server/platform_connector_server_test.go (1)
25-67: Well-structured table-driven test with good coverage.The test correctly validates all three
ProcessingStrategyenum values and properly verifies the in-place mutation behavior. The table-driven approach aligns with coding guidelines.distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml (1)
61-62: No action required. TheprocessingStrategyvalue is properly defined in values.yaml with a sensible default (EXECUTE_REMEDIATION), with documentation explaining the available modes. The implementation is correct.fault-quarantine/pkg/initializer/init.go (1)
66-66: Pipeline change correctly filters health events by processing strategy. The switch toBuildProcessableHealthEventInsertsPipeline()is intentional and well-designed. The function filters to only process health event inserts withEXECUTE_REMEDIATIONstrategy or missing strategy field (for backward compatibility with pre-upgrade events), excludingSTORE_ONLYevents. Both MongoDB and PostgreSQL implementations include proper backward compatibility handling for events created before the upgrade or from custom monitors.tests/fault_quarantine_test.go (3)
234-244: LGTM! Well-structured test setup for ProcessingStrategy validation.The test function follows the established e2e-framework pattern and correctly initializes the test context with an empty config file path, which appears intentional for testing default behavior.
246-282: LGTM! Comprehensive STORE_ONLY validation.The test correctly verifies that events with
STORE_ONLYprocessing strategy do not trigger node conditions, node events, or quarantine state changes. Good coverage of both fatal (SysLogsXIDError) and non-fatal (GpuPowerWatch) event types.
284-325: LGTM! Proper EXECUTE_REMEDIATION behavior validation.The test correctly verifies that events with
EXECUTE_REMEDIATIONprocessing strategy trigger appropriate node conditions, node events, and quarantine state. The assertions cover the expected cluster state modifications.health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (3)
243-246: LGTM! Correct ProcessingStrategy expectation for quarantine events.The test correctly expects
ProcessingStrategy_EXECUTE_REMEDIATIONfor maintenance events mapped to health events. This aligns with the intended behavior where CSP maintenance events should trigger remediation actions.
272-274: LGTM! Correct ProcessingStrategy for healthy events.The healthy event expectation correctly includes
ProcessingStrategy_EXECUTE_REMEDIATION. Based on learnings, healthy events can legitimately useEXECUTE_REMEDIATIONwhen the Fault Quarantine Manager needs to act on them to clear previous fault states.
336-338: LGTM! Consistent ProcessingStrategy for unknown action fallback.The test correctly expects
ProcessingStrategy_EXECUTE_REMEDIATIONeven when theRecommendedActionis unknown. This maintains consistent behavior across all maintenance event mappings.health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (6)
99-108: LGTM! Correct initialization with processing_strategy parameter.The test correctly initializes
PlatformConnectorEventProcessorwithSTORE_ONLYprocessing strategy.
301-302: LGTM! Proper verification of processingStrategy propagation.The assertion correctly verifies that the
processingStrategyis propagated through to the emitted health events.
421-421: LGTM! Consistent processingStrategy verification for multiple GPUs.Both GPU events are correctly verified to have the expected
STORE_ONLYprocessing strategy.Also applies to: 436-436
493-493: LGTM! processingStrategy verified for DCGM connectivity failure events.The assertion correctly verifies that system-level connectivity failure events also carry the configured processing strategy.
523-524: LGTM! Correct use of EXECUTE_REMEDIATION for connectivity restored events.This test correctly uses
EXECUTE_REMEDIATIONfor the connectivity restored scenario. Based on learnings, healthy events can legitimately useEXECUTE_REMEDIATIONwhen the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources.Also applies to: 549-549
595-604: LGTM! Consistent processing_strategy for retry and cache cleanup test.The test correctly initializes the processor with
STORE_ONLYand focuses on retry/cache behavior rather than processing strategy propagation, which is appropriately tested elsewhere.distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml (1)
56-62: LGTM! Well-documented configuration for processing strategy.The
processingStrategyoption is well-documented with valid values and clear semantics. The default toEXECUTE_REMEDIATIONmaintains backward compatibility. Both daemonset templates (DCGM 3.x and 4.x) correctly consume this value via the--processing-strategycommand-line flag. The inline comments follow the coding guidelines for Helm chart documentation.store-client/pkg/client/pipeline_builder.go (1)
35-46: LGTM! Well-documented interface extensions for processable event pipelines.The new interface methods are clearly documented with their purpose and intended consumers. Both implementations are present in the MongoDB and PostgreSQL builders, and they're correctly used via the
GetPipelineBuilder()interface pattern. The naming convention follows the established pattern consistently.tests/gpu_health_monitor_test.go (3)
41-48: LGTM! New constants and context keys for STORE_ONLY test.The new constants and context keys are well-defined and follow the existing patterns in the file.
665-706: LGTM! Test setup correctly configures STORE_ONLY strategy and injects error.The setup properly updates the DaemonSet with the
--processing-strategy STORE_ONLYflag, waits for rollout, and injects a test error to verify that the cluster state remains unaffected.
708-724: LGTM! Assess correctly validates STORE_ONLY behavior.The assertions properly verify that STORE_ONLY events don't create node conditions or cordon the node.
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)
1-51: Auto-generated protobuf code - no review needed.This file is auto-generated by the protocol buffer compiler as indicated by the header comment. The changes correctly reflect the addition of the
ProcessingStrategyenum and field to the HealthEvent message, with properly adjusted serialized offsets.platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
1550-1630: LGTM! Well-structured table-driven test for ProcessingStrategy behavior.The test properly validates that STORE_ONLY events don't modify cluster state while EXECUTE_REMEDIATION events do. Each test case uses isolated resources, ensuring test independence.
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (3)
327-345: LGTM! Clean implementation of event filtering.The
filterProcessableEventsfunction correctly filters out STORE_ONLY events and provides appropriate logging for skipped events. This aligns with the PR's goal of allowing events to be stored without triggering remediation.
347-372: LGTM! Well-encapsulated Kubernetes event creation.The
createK8sEventfunction properly encapsulates the event creation logic. TheEvent.Typebeing set tohealthEvent.CheckNameis confirmed as an intentional design choice for NVSentinel, based on learnings.
374-418: LGTM! Correct integration of filtering and event creation.The
processHealthEventsfunction properly usesfilterProcessableEventsto ensure only actionable events modify cluster state. The logic correctly handles emptyprocessableEventsby skipping node condition updates and event creation.health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (2)
24-24: LGTM! Import alias for ProcessingStrategy usage.The import alias
platformconnector_pb2provides clear access to theProcessingStrategyenum used throughout the CLI.
37-50: LGTM! Proper type hint for the new parameter.The
processing_strategyparameter has appropriate type annotation usingplatformconnector_pb2.ProcessingStrategy, maintaining consistency with other parameters in the function.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (2)
14-18: LGTM - ProcessingStrategy enum properly defined.The new enum follows standard protobuf stub patterns with the three expected values (UNSPECIFIED, EXECUTE_REMEDIATION, STORE_ONLY) and corresponding module-level constants.
Also applies to: 32-34
63-141: LGTM - HealthEvent correctly extended with processingStrategy field.The field is properly added to
__slots__, field numbers, type annotations, and__init__signature. The_Optional[_Union[ProcessingStrategy, str]]type correctly allows both enum values and string representations per protobuf conventions.health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (4)
106-121: LGTM - processingStrategy correctly propagated to connectivity restoration event.The strategy is properly included in the HealthEvent for clearing DCGM connectivity failures. Based on learnings, this correctly supports healthy events that may use EXECUTE_REMEDIATION to clear previous fault states.
206-223: LGTM - processingStrategy correctly propagated to failure health events.The strategy is properly included when creating HealthEvent instances for GPU failures with entity impacts.
270-287: LGTM - processingStrategy correctly propagated to healthy state events.The strategy is properly included when creating HealthEvent instances for healthy GPU states.
366-381: LGTM - processingStrategy correctly propagated to DCGM connectivity failure events.All four HealthEvent creation points in this file (
clear_dcgm_connectivity_failure,health_event_occurredfailure path,health_event_occurredhealthy path, anddcgm_connectivity_failed) are consistently updated to include the processing strategy field.
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
Show resolved
Hide resolved
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
Show resolved
Hide resolved
π‘οΈ CodeQL Analysisπ¨ Found 1 security alert(s) π View details |
9c0336d to
b616378
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
π€ Fix all issues with AI agents
In @tests/gpu_health_monitor_test.go:
- Around line 726-734: In the feature.Teardown closure, add a nil/type-safe
check when retrieving nodeName from context (replace
ctx.Value(keyNodeName).(string) with a guarded retrieval that handles nil or
wrong-type and fails the test with require.NotNil/require.IsType or similar) to
avoid a panic if setup failed; also replace the hardcoded
"gpu-health-monitor-dcgm-4.x" and "gpu-health-monitor" arguments passed to
helpers.RemoveDaemonSetArgs with the defined constants
GPUHealthMonitorDaemonSetName and GPUHealthMonitorContainerName so the teardown
uses the canonical names.
- Line 47: The constant keyOriginalDaemonSet is declared but never used; either
delete the keyOriginalDaemonSet declaration to remove dead code, or if it's
intentionally reserved for future tests, keep it and add a TODO comment on the
declaration (e.g., "// TODO: keep for future original DaemonSet comparison") so
its presence is documented.
π§Ή Nitpick comments (8)
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (1)
99-108: Consider using keyword arguments for theprocessing_strategyparameter.For maintainability and clarity, using keyword arguments (as done in later tests like
test_dcgm_connectivity_failed) is preferable to positional arguments, especially with 8 parameters.β»οΈ Suggested improvement
platform_connector_test = platform_connector.PlatformConnectorEventProcessor( - socket_path, - node_name, - exit, - dcgm_errors_info_dict, - "statefile", - dcgm_health_conditions_categorization_mapping_config, - "/tmp/test_metadata.json", - platformconnector_pb2.STORE_ONLY, + socket_path=socket_path, + node_name=node_name, + exit=exit, + dcgm_errors_info_dict=dcgm_errors_info_dict, + state_file_path="statefile", + dcgm_health_conditions_categorization_mapping_config=dcgm_health_conditions_categorization_mapping_config, + metadata_path="/tmp/test_metadata.json", + processing_strategy=platformconnector_pb2.STORE_ONLY, )health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (1)
125-132: Consider improving log readability and filtering UNSPECIFIED.Two observations:
Log message readability: Line 132 logs the integer enum value. Consider logging the strategy name for better operator experience.
UNSPECIFIED in valid options:
ProcessingStrategy.keys()includesUNSPECIFIED, which users could technically pass. Consider whether this should be excluded from valid options.β»οΈ Suggested improvement
try: processing_strategy_value = platformconnector_pb2.ProcessingStrategy.Value(processing_strategy) + if processing_strategy_value == platformconnector_pb2.UNSPECIFIED: + log.fatal("UNSPECIFIED is not a valid processing_strategy. Use EXECUTE_REMEDIATION or STORE_ONLY.") + sys.exit(1) except ValueError: - valid_strategies = list(platformconnector_pb2.ProcessingStrategy.keys()) + valid_strategies = [k for k in platformconnector_pb2.ProcessingStrategy.keys() if k != "UNSPECIFIED"] log.fatal(f"Invalid processing_strategy '{processing_strategy}'. " f"Valid options are: {valid_strategies}") sys.exit(1) - log.info(f"Event handling strategy configured to: {processing_strategy_value}") + log.info(f"Event handling strategy configured to: {processing_strategy}")tests/helpers/kube.go (4)
2281-2293: Unused variableoriginalDaemonSet.The
originalDaemonSetvariable is assigned but never used. This appears to be dead code, possibly leftover from a previous implementation that planned to restore the original state.β»οΈ Suggested fix
- var originalDaemonSet *appsv1.DaemonSet - t.Logf("Updating daemonset %s/%s with args %v", NVSentinelNamespace, daemonsetName, args) err := retry.RetryOnConflict(retry.DefaultRetry, func() error { daemonSet := &appsv1.DaemonSet{} if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil { return err } - if originalDaemonSet == nil { - originalDaemonSet = daemonSet.DeepCopy() - } - containers := daemonSet.Spec.Template.Spec.Containers
2305-2315: Consider removing unnecessary sleep and wrapping error with context.
The 10-second sleep after
waitForDaemonSetRolloutmay be unnecessary since the rollout wait already ensures pods are ready. If additional stabilization time is needed, consider documenting why or using a named constant.Per coding guidelines, errors should be wrapped with context using
fmt.Errorf.β»οΈ Suggested fix
if err != nil { - return err + return fmt.Errorf("failed to update daemonset %s/%s args: %w", NVSentinelNamespace, daemonsetName, err) } t.Logf("Waiting for daemonset %s/%s rollout to complete", NVSentinelNamespace, daemonsetName) waitForDaemonSetRollout(ctx, t, client, daemonsetName) - t.Logf("Waiting 10 seconds for daemonset pods to start") - time.Sleep(10 * time.Second) - return nil
2343-2350: Inconsistent error handling pattern.The function uses
require.NoErrorwhich will panic on error, but then returnsnilunconditionally. This is inconsistent withUpdateDaemonSetArgswhich returns errors to the caller. Consider using consistent error handling:β»οΈ Option 1: Return error like UpdateDaemonSetArgs
}) - require.NoError(t, err, "failed to remove args from daemonset %s/%s", NVSentinelNamespace, daemonsetName) + if err != nil { + return fmt.Errorf("failed to remove args from daemonset %s/%s: %w", NVSentinelNamespace, daemonsetName, err) + } t.Logf("Waiting for daemonset %s/%s rollout to complete after restoration", NVSentinelNamespace, daemonsetName)
2433-2477: ParameterdaemonsetNameis not used for filtering.The
daemonsetNameparameter is only used in the error message but not for actual pod filtering. The function finds pods bypodNamePatternalone viaGetPodOnWorkerNode. If the intent is to verify the pod belongs to the specific DaemonSet, consider adding owner reference validation. Otherwise, consider removing the parameter or documenting that it's for logging only.β»οΈ Option: Remove unused parameter or add owner verification
If the parameter is only for logging:
-func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client, - daemonsetName string, podNamePattern string) (*v1.Pod, error) { +func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client, + podNamePattern string) (*v1.Pod, error) {Or add validation that pod is owned by the DaemonSet:
// After getting the pod, verify ownership for _, ownerRef := range pod.OwnerReferences { if ownerRef.Kind == "DaemonSet" && ownerRef.Name == daemonsetName { // Pod belongs to expected DaemonSet break } }tests/gpu_health_monitor_test.go (2)
669-673: Use constants consistently for DaemonSet and container names.The constants
GPUHealthMonitorDaemonSetNameandGPUHealthMonitorContainerNameare defined but not consistently used. Line 669 correctly uses the constants, but line 673 duplicates the DaemonSet name string.β»οΈ Suggested fix
- gpuHealthMonitorPod, err := helpers.GetDaemonSetPodOnWorkerNode(ctx, t, client, GPUHealthMonitorDaemonSetName, "gpu-health-monitor-dcgm-4.x") + gpuHealthMonitorPod, err := helpers.GetDaemonSetPodOnWorkerNode(ctx, t, client, GPUHealthMonitorDaemonSetName, GPUHealthMonitorDaemonSetName)
691-694: Remove redundant context retrieval in setup.The variables
nodeNameandpodNameare retrieved from context immediately after being stored, whiletestNodeNameandgpuHealthMonitorPodNameare still in scope. This is unnecessary indirection.β»οΈ Suggested simplification
ctx = context.WithValue(ctx, keyNodeName, testNodeName) ctx = context.WithValue(ctx, keyGpuHealthMonitorPodName, gpuHealthMonitorPodName) restConfig := client.RESTConfig() - nodeName := ctx.Value(keyNodeName).(string) - podName := ctx.Value(keyGpuHealthMonitorPodName).(string) - - t.Logf("Injecting Inforom error on node %s", nodeName) + t.Logf("Injecting Inforom error on node %s", testNodeName) cmd := []string{"/bin/sh", "-c", fmt.Sprintf("dcgmi test --host %s:%s --inject --gpuid 0 -f 84 -v 0", dcgmServiceHost, dcgmServicePort)} - stdout, stderr, execErr := helpers.ExecInPod(ctx, restConfig, helpers.NVSentinelNamespace, podName, "", cmd) + stdout, stderr, execErr := helpers.ExecInPod(ctx, restConfig, helpers.NVSentinelNamespace, gpuHealthMonitorPodName, "", cmd)
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (10)
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yamlhealth-monitors/gpu-health-monitor/gpu_health_monitor/cli.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.pytests/event_exporter_test.gotests/gpu_health_monitor_test.gotests/helpers/event_exporter.gotests/helpers/kube.go
π§ Files skipped from review as they are similar to previous changes (4)
- distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
- tests/event_exporter_test.go
- distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
- distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
π§° Additional context used
π Path-based instructions (3)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/helpers/event_exporter.gotests/gpu_health_monitor_test.gotests/helpers/kube.go
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/gpu_health_monitor_test.go
π§ Learnings (4)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
tests/helpers/event_exporter.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/gpu_health_monitor_test.go
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
𧬠Code graph analysis (1)
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (2)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(45-45)ProcessingStrategy(77-79)ProcessingStrategy(81-83)ProcessingStrategy(90-92)health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)
π Additional comments (14)
tests/helpers/event_exporter.go (2)
220-226: LGTM - Function signature updated correctly for processing strategy validation.The new
expectedProcessingStrategyparameter follows the established pattern. Consider updating the function's godoc comment to document all parameters, as per coding guidelines requiring function comments for exported functions.
245-245: Assertion correctly validates the processingStrategy field.The new assertion follows the existing pattern and validates that the
healthEventcontains the expected processing strategy. This aligns with the PR's objective of adding event handling strategy changes.health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (1)
301-302: Test coverage forprocessingStrategypropagation looks good.The assertions correctly verify that:
- Failure events propagate the configured
STORE_ONLYstrategy- Connectivity restored events propagate
EXECUTE_REMEDIATIONstrategy (valid per the learning that healthy events can use EXECUTE_REMEDIATION when the Fault Quarantine Manager needs to clear previous fault states)Also applies to: 421-421, 436-436, 493-493, 549-549
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (2)
41-66: Constructor changes are well-structured.The new
processing_strategyparameter:
- Has proper type hint (
platformconnector_pb2.ProcessingStrategy) per coding guidelines- Uses underscore prefix convention for internal state
- Is stored for consistent propagation to all emitted events
106-121: Consistent propagation ofprocessingStrategyacross all event construction sites.The
processingStrategyfield is correctly added to all fourHealthEventconstruction paths:
clear_dcgm_connectivity_failure(Line 120)health_event_occurred- failure branch (Line 221)health_event_occurred- healthy branch (Line 285)dcgm_connectivity_failed(Line 380)Also applies to: 206-222, 270-286, 366-381
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (2)
74-80: CLI option addition looks correct.The
--processing-strategyoption is well-defined with:
- Sensible default (
EXECUTE_REMEDIATION)- Clear help text describing the two main options
28-51: Function signature update is correct.The
_init_event_processorfunction:
- Has proper type hint for the new parameter (
platformconnector_pb2.ProcessingStrategy) per coding guidelines- Correctly propagates the value to
PlatformConnectorEventProcessortests/helpers/kube.go (4)
2232-2273: LGTM!The
waitForDaemonSetRolloutfunction correctly checks all necessary DaemonSet status conditions (DesiredNumberScheduled, UpdatedNumberScheduled, NumberReady) and follows the same pattern as the existingWaitForDeploymentRolloutfunction.
2353-2381: LGTM!The
tryUpdateExistingArghelper correctly handles both--flag=valueand--flag valueargument styles, including the case where a value needs to be inserted after an existing flag.
2383-2405: LGTM!The function correctly sets container arguments, utilizing
tryUpdateExistingArgfor updates and appending new arguments when not found.
2407-2431: LGTM!The function correctly removes container arguments, handling both
--flag=valueand--flag valuestyles. The earlybreakafter removal ensures safe slice modification.tests/gpu_health_monitor_test.go (3)
708-724: Assess phase looks correct for STORE_ONLY verification.The test correctly validates that in STORE_ONLY mode:
- Node conditions are not applied
- Node is not cordoned
This aligns with the expected behavior where events are stored but no cluster modifications occur.
745-749: Potential issue: Pod may have been replaced after DaemonSet args restoration.After
RemoveDaemonSetArgsrestores the DaemonSet configuration at line 732, the pod referenced bypodNamemay have been replaced by a new pod. Executing cleanup commands on this pod could fail if the pod no longer exists or is in a terminating state.Consider either:
- Re-fetching the current pod after the DaemonSet update
- Adding error handling for the exec command (currently errors are silently ignored with
_, _, _ =)The current silent error handling (
_, _, _ =) does provide resilience, but the cleanup may not actually execute. Verify whether this is acceptable for test reliability.
657-664: Test structure follows established patterns.The test properly uses the
features.Newpattern with appropriate labels consistent with other tests in this file. The setup/assess/teardown structure is well-organized.
b616378 to
5ce6685
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
π€ Fix all issues with AI agents
In @tests/gpu_health_monitor_test.go:
- Around line 732-734: Replace the hardcoded DaemonSet and container name
strings in the RemoveDaemonSetArgs call with the existing constants
GPUHealthMonitorDaemonSetName and GPUHealthMonitorContainerName, and align the
arg map value to match the setup phase by using "STORE_ONLY" (or an empty
string) for the "--processing-strategy" value so itβs semantically consistent;
update the call to RemoveDaemonSetArgs(ctx, t, client,
GPUHealthMonitorDaemonSetName, GPUHealthMonitorContainerName,
map[string]string{"--processing-strategy": "STORE_ONLY"}) (or use "" instead of
"STORE_ONLY") to make intent clear.
In @tests/helpers/kube.go:
- Around line 2275-2316: In UpdateDaemonSetArgs remove the unused
originalDaemonSet variable and its DeepCopy assignment, add explicit validation
that the target container was found/updated (e.g., track a bool like
foundContainer when iterating containers and return an error if false) so the
function fails instead of silently succeeding, and eliminate the hardcoded
time.Sleep(10 * time.Second) by relying on waitForDaemonSetRollout or making the
post-rollout wait configurable (e.g., add an optional delay parameter or use
context-based waiting) so no arbitrary 10s sleep remains.
π§Ή Nitpick comments (4)
tests/helpers/kube.go (2)
2318-2351: Inconsistent error handling pattern.The function signature returns
error, but line 2343 usesrequire.NoErrorwhich will panic on failure. This is inconsistent withUpdateDaemonSetArgswhich returns the error to the caller.Either:
- Remove the
require.NoErrorand return the error (consistent withUpdateDaemonSetArgs), or- Change the return type to match the behavior (return nothing, always panic on error)
Also, consider adding container validation similar to the suggestion for
UpdateDaemonSetArgs.Option 1: Return error consistently
err := retry.RetryOnConflict(retry.DefaultRetry, func() error { // ... }) - require.NoError(t, err, "failed to remove args from daemonset %s/%s", NVSentinelNamespace, daemonsetName) + if err != nil { + return fmt.Errorf("failed to remove args from daemonset %s/%s: %w", NVSentinelNamespace, daemonsetName, err) + }
2433-2477: ParameterdaemonsetNameis not used for filtering.The
daemonsetNameparameter is only used in the error message (line 2473) but doesn't actually verify that the found pod belongs to the specified DaemonSet. The filtering relies solely onpodNamePattern.This could lead to incorrect assumptions by callers. Consider either:
- Using
daemonsetNameto verify ownership via labels (e.g., checkingownerReferences), or- Removing the parameter if
podNamePatternis sufficient, or- Adding a comment documenting that
podNamePatternis responsible for filteringOption: Add ownership verification
func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, podNamePattern string) (*v1.Pod, error) { t.Helper() var resultPod *v1.Pod require.Eventually(t, func() bool { // Get the pod pod, err := GetPodOnWorkerNode(ctx, t, client, NVSentinelNamespace, podNamePattern) if err != nil { t.Logf("Failed to get pod: %v", err) return false } + // Verify pod belongs to the expected DaemonSet + belongsToDaemonSet := false + for _, ownerRef := range pod.OwnerReferences { + if ownerRef.Kind == "DaemonSet" && strings.HasPrefix(ownerRef.Name, daemonsetName) { + belongsToDaemonSet = true + break + } + } + if !belongsToDaemonSet { + t.Logf("Pod %s does not belong to daemonset %s", pod.Name, daemonsetName) + return false + } + // Verify pod is not being deletedtests/gpu_health_monitor_test.go (2)
46-48: Unused context keykeyOriginalDaemonSet.The context key
keyOriginalDaemonSetis declared but never used anywhere in the test. Either remove it or implement the intended functionality to store/restore the original DaemonSet state.π§Ή Suggested fix
const ( keyGpuHealthMonitorPodName contextKey = "gpuHealthMonitorPodName" - keyOriginalDaemonSet contextKey = "originalDaemonSet" )
691-694: Redundant context value extraction.Variables
testNodeNameandgpuHealthMonitorPodNameare already in scope from lines 677-678. Extracting them again from context is unnecessary.β»οΈ Suggested fix
ctx = context.WithValue(ctx, keyNodeName, testNodeName) ctx = context.WithValue(ctx, keyGpuHealthMonitorPodName, gpuHealthMonitorPodName) restConfig := client.RESTConfig() - nodeName := ctx.Value(keyNodeName).(string) - podName := ctx.Value(keyGpuHealthMonitorPodName).(string) - - t.Logf("Injecting Inforom error on node %s", nodeName) + t.Logf("Injecting Inforom error on node %s", testNodeName) cmd := []string{"/bin/sh", "-c", fmt.Sprintf("dcgmi test --host %s:%s --inject --gpuid 0 -f 84 -v 0", dcgmServiceHost, dcgmServicePort)} - stdout, stderr, execErr := helpers.ExecInPod(ctx, restConfig, helpers.NVSentinelNamespace, podName, "", cmd) + stdout, stderr, execErr := helpers.ExecInPod(ctx, restConfig, helpers.NVSentinelNamespace, gpuHealthMonitorPodName, "", cmd)
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (3)
tests/gpu_health_monitor_test.gotests/helpers/event_exporter.gotests/helpers/kube.go
π§ Files skipped from review as they are similar to previous changes (1)
- tests/helpers/event_exporter.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/gpu_health_monitor_test.gotests/helpers/kube.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/gpu_health_monitor_test.go
π§ Learnings (3)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/gpu_health_monitor_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
tests/gpu_health_monitor_test.go
𧬠Code graph analysis (1)
tests/gpu_health_monitor_test.go (1)
tests/helpers/kube.go (7)
UpdateDaemonSetArgs(2276-2316)GetDaemonSetPodOnWorkerNode(2433-2477)NVSentinelNamespace(64-64)SetNodeManagedByNVSentinel(1389-1408)ExecInPod(1563-1598)RemoveDaemonSetArgs(2318-2351)RemoveNodeManagedByNVSentinelLabel(1411-1425)
π Additional comments (6)
tests/helpers/kube.go (4)
2232-2273: LGTM with minor observation.The function correctly validates DaemonSet rollout by checking
DesiredNumberScheduled,UpdatedNumberScheduled, andNumberReady. The logging provides good visibility into rollout progress.Consider adding an
ObservedGenerationcheck (similar toWaitForDeploymentRolloutat line 1020) to ensure the DaemonSet controller has processed the latest spec changes, though this is optional for test helpers.
2353-2381: LGTM!The function correctly handles multiple argument styles (
--flag=value,--flag value, and boolean--flag). The slice insertion logic at line 2373 correctly preserves existing arguments while inserting the new value.
2383-2405: LGTM!The function follows the same pattern as
setEnvVarsOnContainerand correctly handles both updating existing arguments and appending new ones.
2407-2431: LGTM!The function correctly handles removal of arguments in both
--flag=valueand--flag valueformats. The immediatebreakafter slice modification avoids index issues during iteration.tests/gpu_health_monitor_test.go (2)
657-765: Test structure and coverage for store-only mode looks good.The test correctly validates that in
STORE_ONLYmode:
- Node conditions are not applied (
EnsureNodeConditionNotPresent)- Node is not cordoned (
AssertQuarantineState)The setup/assess/teardown pattern follows the established conventions in the file.
34-43: LGTM on new constants.The exported constants
GPUHealthMonitorContainerNameandGPUHealthMonitorDaemonSetNameare well-named and useful for test consistency.
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
5ce6685 to
a3f7bef
Compare
Signed-off-by: Tanisha goyal <[email protected]>
9ff31fd to
07739d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
π€ Fix all issues with AI agents
In `@tests/gpu_health_monitor_test.go`:
- Around line 728-744: The teardown handler inside feature.Teardown reads
ctx.Value(keyNodeName) and ctx.Value(keyGpuHealthMonitorOriginalArgs) and casts
them directly, which can panic if those keys are unset; add nil checks like the
existing podNameVal check: retrieve nodeNameVal and originalArgsVal, if either
is nil log a message (e.g., "Skipping teardown: nodeName/originalArgs not set")
and return ctx, otherwise cast to string and []string respectively before
calling helpers.RestoreDaemonSetArgs and proceeding; ensure you reference
keyNodeName and keyGpuHealthMonitorOriginalArgs and keep behavior consistent
with the podNameVal guard.
π§Ή Nitpick comments (3)
tests/gpu_health_monitor_test.go (1)
45-49: Remove unused constantkeyOriginalDaemonSet.The constant is declared at line 47 but is never used anywhere in the codebase. The test uses
keyGpuHealthMonitorOriginalArgsinstead. Remove this dead code or add a TODO comment if reserved for future use.health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (2)
74-80: Consider usingclick.Choicefor built-in validation and auto-completion.The current approach with string type and manual validation works, but
click.Choicewould provide better UX with automatic validation and shell auto-completion support.β»οΈ Optional enhancement
`@click.option`( "--processing-strategy", - type=str, + type=click.Choice(["EXECUTE_REMEDIATION", "STORE_ONLY"], case_sensitive=True), default="EXECUTE_REMEDIATION", help="Event processing strategy: EXECUTE_REMEDIATION or STORE_ONLY", required=False, )Note: This would require adjusting the validation logic at lines 125-130 since Click would handle invalid input.
125-132: Log message could be more user-friendly.Line 132 logs the enum integer value (e.g.,
1) rather than the human-readable name. Consider logging the original string for clarity.β»οΈ Proposed improvement
- log.info(f"Event handling strategy configured to: {processing_strategy_value}") + log.info(f"Event handling strategy configured to: {processing_strategy}")This will log
"Event handling strategy configured to: EXECUTE_REMEDIATION"instead of"Event handling strategy configured to: 1".
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (7)
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yamlhealth-monitors/gpu-health-monitor/gpu_health_monitor/cli.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.pytests/gpu_health_monitor_test.go
π§ Files skipped from review as they are similar to previous changes (1)
- distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
π§° Additional context used
π Path-based instructions (6)
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/gpu_health_monitor_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/gpu_health_monitor_test.go
**/daemonset*.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
Explain DaemonSet variant selection logic in Helm chart documentation
Files:
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
distros/kubernetes/**/*daemonset*.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes usingnodeAffinitybased on kata.enabled label
Regular node DaemonSets should use/var/logvolume mount for file-based logs
Kata node DaemonSets should use/run/log/journaland/var/log/journalvolume mounts for systemd journal
Files:
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
π§ Learnings (8)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: health-monitors/syslog-health-monitor/main.go:164-172
Timestamp: 2026-01-14T06:30:15.804Z
Learning: In NVSentinel's syslog-health-monitor, the processing strategy flag accepts UNSPECIFIED from configuration, and platform_connector normalizes any UNSPECIFIED value to EXECUTE_REMEDIATION to maintain consistency with the default execution mode. This normalization happens in platform_connector_server.go around lines 59-60.
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.pydistros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
π Learning: 2026-01-14T06:30:15.804Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: health-monitors/syslog-health-monitor/main.go:164-172
Timestamp: 2026-01-14T06:30:15.804Z
Learning: In NVSentinel's syslog-health-monitor, the processing strategy flag accepts UNSPECIFIED from configuration, and platform_connector normalizes any UNSPECIFIED value to EXECUTE_REMEDIATION to maintain consistency with the default execution mode. This normalization happens in platform_connector_server.go around lines 59-60.
Applied to files:
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.pydistros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yamlhealth-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/gpu_health_monitor_test.go
π Learning: 2026-01-14T02:33:00.058Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 689
File: janitor/pkg/controller/rebootnode_controller_test.go:371-436
Timestamp: 2026-01-14T02:33:00.058Z
Learning: In the NVSentinel janitor controller tests, tests that demonstrate original bugs or issues that were fixed by a PR should be kept for posterity, even if they reference removed functionality like MaxRebootRetries or RetryCount fields. These historical test cases serve as documentation of what problem was being solved.
Applied to files:
tests/gpu_health_monitor_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
tests/gpu_health_monitor_test.go
π Learning: 2026-01-12T05:13:24.947Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:24.947Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.
Applied to files:
tests/gpu_health_monitor_test.go
𧬠Code graph analysis (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (2)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)
tests/gpu_health_monitor_test.go (4)
tests/helpers/kube.go (8)
UpdateDaemonSetArgs(2300-2346)GetDaemonSetPodOnWorkerNode(2450-2494)NVSentinelNamespace(64-64)SetNodeManagedByNVSentinel(1401-1420)ExecInPod(1575-1610)EnsureNodeConditionNotPresent(1811-1832)RestoreDaemonSetArgs(2350-2392)RemoveNodeManagedByNVSentinelLabel(1423-1437)tests/helpers/metadata.go (3)
CreateTestMetadata(59-108)InjectMetadata(110-181)DeleteMetadata(183-229)commons/pkg/auditlogger/auditlogger.go (1)
Log(114-134)tests/helpers/fault_quarantine.go (2)
AssertQuarantineState(317-384)QuarantineAssertion(56-60)
β° Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: ko-build-test (health-monitors/kubernetes-object-monitor, .)
- GitHub Check: ko-build-test (labeler, .)
- GitHub Check: ko-build-test (fault-remediation, .)
- GitHub Check: container-build-test (syslog-health-monitor, make -C health-monitors/syslog-health-monitor docker...
- GitHub Check: container-build-test (gpu-health-monitor-dcgm4, make -C health-monitors/gpu-health-monitor docker...
- GitHub Check: container-build-test (file-server-cleanup, make -C log-collector docker-build-file-server-cleanup)
- GitHub Check: container-build-test (gpu-health-monitor-dcgm3, make -C health-monitors/gpu-health-monitor docker...
- GitHub Check: container-build-test (log-collector, make -C log-collector docker-build-log-collector)
- GitHub Check: E2E Tests (ARM64 + PostgreSQL)
- GitHub Check: E2E Tests (ARM64 + MongoDB)
- GitHub Check: E2E Tests (AMD64 + PostgreSQL)
- GitHub Check: E2E Tests (AMD64 + MongoDB)
- GitHub Check: CodeQL PR Analysis
- GitHub Check: simple-lint (scripts, make -C scripts lint, Run shellcheck on scripts)
- GitHub Check: modules-lint-test (fault-quarantine)
- GitHub Check: modules-lint-test (labeler)
- GitHub Check: modules-lint-test (platform-connectors)
- GitHub Check: modules-lint-test (node-drainer)
- GitHub Check: health-monitors-lint-test (csp-health-monitor)
- GitHub Check: health-monitors-lint-test (kubernetes-object-monitor)
π Additional comments (19)
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml (1)
56-62: Well-documented processingStrategy configuration.The inline comments clearly explain both modes and their behavior. The default
EXECUTE_REMEDIATIONaligns with the platform_connector's normalization behavior (UNSPECIFIED β EXECUTE_REMEDIATION). Based on learnings, this is consistent with how syslog-health-monitor handles the same flag.Consider adding a note that
UNSPECIFIEDis also accepted but will be normalized toEXECUTE_REMEDIATIONby the platform connector, for completeness.distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml (1)
61-62: LGTM!The
--processing-strategyargument is correctly added and sourced from.Values.processingStrategy. The quoting ensures proper string handling.tests/gpu_health_monitor_test.go (2)
658-708: Test setup properly configures STORE_ONLY mode and injects errors.The setup correctly:
- Updates DaemonSet args to use
STORE_ONLYstrategy- Retrieves the pod after rollout
- Injects test metadata and error
- Stores original args for restoration
One observation: The error injection happens in setup (lines 698-705) rather than in an Assess step. This is fine for this test since you're verifying the absence of cluster changes, but consider documenting this design choice with a comment.
710-726: Assess step correctly validates STORE_ONLY behavior.The assertions properly verify that:
- Node condition
GpuInforomWatchis not applied- Node is not cordoned and has no quarantine annotation
This aligns with the PR objective of verifying that STORE_ONLY mode persists/exports events without modifying cluster resources.
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (5)
41-66: Clean implementation of processing_strategy parameter.The new
processing_strategyparameter is properly typed withplatformconnector_pb2.ProcessingStrategyand stored as an instance variable. This follows PEP 8 and the coding guidelines requirement for type hints.
106-121: processingStrategy correctly propagated in clear_dcgm_connectivity_failure.The HealthEvent includes
processingStrategy=self._processing_strategy, ensuring connectivity restoration events respect the configured strategy.
206-223: processingStrategy correctly propagated in health_event_occurred (failure path).The HealthEvent for GPU failures includes the processing strategy field.
270-287: processingStrategy correctly propagated in health_event_occurred (healthy path).The HealthEvent for healthy status includes the processing strategy field, maintaining consistency across all event types.
366-381: processingStrategy correctly propagated in dcgm_connectivity_failed.The DCGM connectivity failure HealthEvent includes the processing strategy field, completing the coverage of all HealthEvent creation paths.
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (6)
99-108: Test correctly updated to include processing_strategy parameter.The
PlatformConnectorEventProcessorinstantiation now includesplatformconnector_pb2.STORE_ONLYas the processing_strategy argument, aligning with the updated constructor signature.
301-302: Good assertion for processingStrategy propagation.Verifying that
nvlink_failure_event.processingStrategy == platformconnector_pb2.STORE_ONLYconfirms the field is correctly propagated through the HealthEvent pipeline.
421-436: Assertions verify processingStrategy for multi-GPU events.Both GPU 0 and GPU 1 events are checked for the correct processingStrategy value, ensuring consistent propagation across multiple entities.
493-494: DCGM connectivity failure event correctly asserts processingStrategy.The test verifies the processingStrategy field is set correctly for connectivity failure events.
523-549: Good coverage of EXECUTE_REMEDIATION strategy.This test uses
EXECUTE_REMEDIATION(line 523) and verifies the restored event carries this strategy (line 549), providing coverage for both strategy values in the test suite.
595-604: Retry test correctly updated with processing_strategy.The cache cleanup and retry test includes the processing_strategy parameter, maintaining consistency with other test cases.
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (4)
24-24: LGTM!The alias
platformconnector_pb2clearly indicates the module's purpose and is used consistently throughout the file.
28-51: LGTM!The function signature correctly includes the new
processing_strategyparameter with proper type hints as per coding guidelines, and the parameter is properly propagated toPlatformConnectorEventProcessor.
81-91: LGTM!The new
processing_strategyparameter is correctly added to the CLI function signature, consistent with the existing parameter style.
137-150: LGTM!The
processing_strategy_valueis correctly propagated to the event processor initialization, matching the expected protobuf enum type.
βοΈ Tip: You can disable this entire section by setting review_details to false in your review settings.
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
80494cc to
9424707
Compare
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Testing
Summary by CodeRabbit
New Features
Tests
Chores
βοΈ Tip: You can customize this high-level summary in your review settings.