Skip to content

Conversation

@ivelichkovich
Copy link
Contributor

@ivelichkovich ivelichkovich commented Dec 11, 2025

Summary

Type of Change

  • πŸ› Bug fix
  • ✨ New feature
  • πŸ’₯ Breaking change
  • πŸ“š Documentation
  • [ X ] πŸ”§ Refactoring
  • πŸ”¨ Build/CI

Component(s) Affected

  • [ X ] Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features
    • Multi‑template remediation with templated maintenance resources, per‑node owner references, and orchestrated log‑collector jobs; public node annotation & remediation client APIs.
  • Improvements
    • Unified controller startup with HTTP auditing, consistent health/readiness probes, exported Prometheus metrics, and tightened RBAC watches/updates.
  • Chores
    • Simplified startup flags and Helm chart defaults; removed legacy startup paths and conditional runtime flags.
  • Tests
    • Extensive new and updated unit and e2e tests covering remediation, annotation, and log‑collector workflows.

✏️ Tip: You can customize this high-level summary in your review settings.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 11, 2025

πŸ“ Walkthrough

Walkthrough

Unifies startup onto a controller-runtime Manager, registers core/v1 and batch/v1 schemes, adds HTTP auditing, consolidates initialization into a single controller-runtime flow, and splits/removes legacy reconciler internals into new annotation, remediation, crstatus, events, and metrics packages with accompanying tests and templates.

Changes

Cohort / File(s) Summary
Controller runtime & startup
fault-remediation/main.go, distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml, distros/kubernetes/nvsentinel/values*.yaml
Remove dual init path and ctrlRuntimeEnabled flag; add setupCtrlRuntimeManagement single flow creating controller-runtime Manager, register core/v1 and batch/v1 schemes, wrap HTTP transport with auditing, configure health/metrics/leader-election and probe args.
Annotation package
fault-remediation/pkg/annotation/annotation_interface.go, fault-remediation/pkg/annotation/annotation.go, fault-remediation/pkg/annotation/annotation_test.go
New AnnotationKey, RemediationStateAnnotation, EquivalenceGroupState, NodeAnnotationManagerInterface and NodeAnnotationManager with Get/Update/Clear/RemoveGroup methods; unit tests using fake controller-runtime client.
Remediation package
fault-remediation/pkg/remediation/fault_remediation_client_interface.go, fault-remediation/pkg/remediation/remediation.go, fault-remediation/pkg/remediation/remediation_test.go, fault-remediation/pkg/remediation/templates/*
New FaultRemediationClient and TemplateData; multi-template loading/rendering, action/template selection, ownerRef handling, create unstructured CRs, update node annotations, orchestrate log-collector Job lifecycle (create/reuse/poll/metrics), support dry-run; templates and tests added.
CR status checking
fault-remediation/pkg/crstatus/checker.go, fault-remediation/pkg/crstatus/crstatus_interface.go, fault-remediation/pkg/crstatus/crstatus_test.go
Replace dynamic client & RESTMapper with controller-runtime client.Client; use unstructured.Unstructured with GVK + client.Get; add CRStatusCheckerInterface and adjust constructor/signatures.
Initializer
fault-remediation/pkg/initializer/init.go
InitializationParams now includes *rest.Config; InitializeAll updated to accept controller-runtime client; create RemediationClient and StateManager from manager config/client and wire reconciler with manager client.
Reconciler & tests
fault-remediation/pkg/reconciler/reconciler.go, fault-remediation/pkg/reconciler/*_test.go, fault-remediation/pkg/reconciler/reconciler_e2e_test.go
Reconciler refactored to use public Config, annotation.NodeAnnotationManagerInterface, remediation.FaultRemediationClientInterface, and crstatus interface; event types moved to events; method signatures adjusted (many now return ctrl.Result, error); tests updated to use public interfaces and metrics exports.
Events & metrics
fault-remediation/pkg/events/health_event.go, fault-remediation/pkg/metrics/metrics.go
Add HealthEventDoc / HealthEventData wrappers; move metrics into metrics package and export metric symbols (CamelCase).
Removed legacy reconciler internals & tests
fault-remediation/pkg/reconciler/annotation.go (deleted), fault-remediation/pkg/reconciler/remediation.go (deleted), related tests removed
Remove old reconciler-embedded annotation and remediation implementations and their tests; functionality replaced by new annotation and remediation packages.
Infra & misc
.gitignore, distros/.../templates/clusterrole.yaml, tilt/csp-api-mock/Tiltfile, fault-remediation/go.mod, .github/workflows/e2e-test.yml, tests/helpers/kube.go, distros/.../files/log-collector-job.yaml, tests/Makefile
.idea/ ignore simplified; RBAC verbs expanded (watch/update); tilt host port changed; golang.org/x/sync moved to indirect require; e2e debug artifact for jobs added; job label added; test runner limited to a single test; small test logging tweak.

Sequence Diagram(s)

sequenceDiagram
  participant Manager as Manager
  participant Reconciler as Reconciler
  participant Remediation as RemediationClient
  participant K8s as K8sAPI
  participant Datastore as Datastore

  Manager->>Reconciler: deliver watcher event (EventWithToken)
  Reconciler->>Datastore: read/ack event (parse EventWithToken)
  Reconciler->>Remediation: CreateMaintenanceResource(ctx, HealthEventData)
  Remediation->>K8s: render template & create Unstructured CR (with OwnerReference)
  K8s-->>Remediation: 201 / AlreadyExists / error
  Remediation->>K8s: patch node annotation (remediation state)
  Remediation->>K8s: create/reconcile LogCollector Job
  K8s-->>Remediation: Job status updates (complete/failed/timeout)
  Reconciler-->>Manager: return ctrl.Result (requeue/success)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐰 I hopped through code with nimble paws,
Templates stitched and annotations stored,
A Manager woke and jobs took flight,
Logs whispered secrets through the night,
A rabbit's patch β€” systems restored.

πŸš₯ Pre-merge checks | βœ… 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 24.24% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
βœ… Passed checks (2 passed)
Check name Status Explanation
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check βœ… Passed The title clearly and concisely summarizes the main change: migrating remediation business logic to use controller-runtime.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Collaborator

@lalitadithya lalitadithya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, I like the direction! Let's keep pushing forward on this

@ivelichkovich ivelichkovich force-pushed the remediationerrors branch 4 times, most recently from 5d99015 to f0f60b1 Compare January 6, 2026 02:01
@ivelichkovich ivelichkovich marked this pull request as ready for review January 6, 2026 02:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Fix all issues with AI Agents πŸ€–
In @.idea/NVSentinel.iml:
- Around line 1-4: The repository is tracking IDE files .idea/NVSentinel.iml and
.idea/vcs.xml; update .gitignore to add a blanket ignore for .idea/ (or
uncomment/add the β€œ.idea/” entry) so .iml and VCS configs are excluded, then
remove the tracked files from Git with git rm --cached .idea/NVSentinel.iml
.idea/vcs.xml and commit the change to stop committing IDE-specific config
moving forward.

In @fault-remediation/main.go:
- Around line 171-182: The retry loop around
components.FaultRemediationReconciler.Reconcile does not check gCtx cancellation
and can block shutdown; modify the loop to respect gCtx.Done() by exiting early
when the context is cancelled (check gCtx.Err() or select on gCtx.Done() before
each retry and before sleeping), return or break out of the loop when cancelled,
and avoid sleeping unconditionally by using a context-aware wait (select on
time.After(components.FaultRemediationReconciler.Config.UpdateRetryDelay) and
gCtx.Done()). Ensure you still log the last error but stop retrying if gCtx is
cancelled.

In @fault-remediation/pkg/annotation/annotation.go:
- Around line 66-74: GetRemediationState can fail leaving node nil which later
causes a panic when node.DeepCopy() is called; update the error path in
GetRemediationState handling so that when err != nil you either return the error
immediately or create a safe non-nil node placeholder before continuing (e.g.,
instantiate a new corev1.Node or ensure callers check for nil), and ensure
callers of RemediationStateAnnotation logic (where node.DeepCopy() is invoked)
only call DeepCopy on a non-nil node; reference GetRemediationState,
RemediationStateAnnotation, and node.DeepCopy to locate and fix the nil check
and handling.

In @fault-remediation/pkg/crstatus/checker.go:
- Around line 58-66: The ObjectKey for the GET only sets Name and omits
Namespace, which breaks lookups for namespaced CRs; update the client.ObjectKey
creation to include the CR namespace (e.g., client.ObjectKey{Name: crName,
Namespace: maintenanceResource.Namespace} or the appropriate local variable
holding the namespace) before calling c.client.Get, and ensure the warning log
(slog.Warn) also includes the namespace for clearer diagnostics; if the resource
is cluster-scoped, allow the Namespace to be empty when constructing the
ObjectKey.

In @fault-remediation/pkg/remediation/deprecated_remediation.go:
- Around line 266-293: The code calls AnnotationManager.UpdateRemediationState
twice: first using crName and then again using actualCRName after computing
actualCRName := createdCR.GetName(); remove the redundant first update (the
block that uses crName) and keep only the second update that sets the annotation
to actualCRName; ensure the logic still checks group :=
common.GetRemediationGroupForAction(healthEvent.RecommendedAction) and
c.AnnotationManager != nil before calling
AnnotationManager.UpdateRemediationState so the annotation is updated once with
the real CR name (refer to createdCR.GetName, crName, actualCRName, and
AnnotationManager.UpdateRemediationState to locate the code).

In @fault-remediation/pkg/remediation/remediation_test.go:
- Around line 100-127: Two test cases in remediation_test.go use the same name
"Successful rebootnode creation", causing ambiguous test output; change the
`name` field for one or both cases to be unique (e.g. include dryRun state) so
they read distinct names like "Successful rebootnode creation - non-dry-run" and
"Successful rebootnode creation - dry-run"; update the `name` values inside the
table-driven test entries that contain `nodeName: "test-node-1"` and `dryRun:
false`/`true` (the test case structs shown in the diff) to avoid duplicate test
case names.

In @fault-remediation/pkg/remediation/remediation.go:
- Around line 304-317: Between checking existingJobs.Items and calling
c.client.Create(ctx, job) there is a race: another reconcile may create the job
causing Create to return an IsAlreadyExists error; update the creation path in
the function handling existingJobs/ Create so that after err :=
c.client.Create(ctx, job) you check for apierrors.IsAlreadyExists(err) (or the
equivalent IsAlreadyExists helper your codebase uses) and treat that case as
success by returning ctrl.Result{RequeueAfter: 10 * time.Second}, nil (or
proceed as if job was created), while still returning other errors unchanged;
reference symbols: existingJobs.Items, c.client.Create, job, and
apierrors.IsAlreadyExists.
- Around line 183-208: The first annotation update block duplicates the later,
authoritative update and uses the provisional crName; remove the initial call
that checks common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
and calls c.annotationManager.UpdateRemediationState with crName, and keep only
the second update after obtaining actualCRName from maintenance.GetName so the
node annotation is set once with the real CR name (use the existing
healthEvent.NodeName, group and actualCRName with
c.annotationManager.UpdateRemediationState).
🧹 Nitpick comments (15)
.idea/vcs.xml (1)

1-6: IDE configuration files should be gitignored, not committed.

The .idea/vcs.xml file is an IntelliJ IDEA configuration artifact that is local to each developer's environment and should not be committed to version control. These files are typically auto-generated by the IDE, vary across developers, and create unnecessary noise and merge conflicts.

Add .idea/ (or at minimum .idea/vcs.xml) to .gitignore instead of committing this file. If the file was committed inadvertently, consider removing it from the repository with git rm --cached .idea/vcs.xml.

fault-remediation/pkg/annotation/annotation.go (2)

40-40: TODO comment should reference an issue.

Per coding guidelines, TODO comments should reference issues for tracking.


51-57: Silent error recovery may mask annotation data corruption.

Returning an empty state when JSON unmarshaling fails is defensive, but it silently discards potentially corrupted data. Consider returning the error to allow callers to decide how to handle it, or at minimum add a metric/alert for this condition.

fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)

29-34: Add godoc comment for exported interface.

Per coding guidelines, exported functions and types require documentation comments. The interface methods would benefit from brief descriptions of their behavior and return semantics.

πŸ”Ž Proposed fix
+// FaultRemediationClientInterface defines the contract for fault remediation operations
+// including maintenance resource creation, log collection, and access to annotation/status components.
 type FaultRemediationClientInterface interface {
 	CreateMaintenanceResource(ctx context.Context, healthEventData *events.HealthEventData) (string, error)
 	RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error)
 	GetAnnotationManager() annotation.NodeAnnotationManagerInterface
 	GetStatusChecker() crstatus.CRStatusCheckerInterface
 }

31-31: Use eventID instead of eventId for Go naming conventions.

Go convention for acronyms in identifiers is to use all caps (e.g., eventID, httpURL).

fault-remediation/pkg/crstatus/crstatus_interface.go (1)

7-9: Add godoc and use named parameters for clarity.

The interface lacks documentation. The unnamed string parameter is unclear - is it a CR name, node name, or identifier? Named parameters improve readability and self-documentation.

πŸ”Ž Proposed fix
+// CRStatusCheckerInterface defines the contract for checking CR status
+// to determine if creation should be skipped (e.g., when remediation is in progress).
 type CRStatusCheckerInterface interface {
-	ShouldSkipCRCreation(context.Context, string) bool
+	ShouldSkipCRCreation(ctx context.Context, crName string) bool
 }
fault-remediation/pkg/crstatus/crstatus_test.go (1)

120-212: Consider consolidating duplicate test cases.

TestCheckConditionCtrlRuntime duplicates the exact same test cases as TestCheckCondition. Consider extracting the shared test cases into a variable and reusing them, or using a parameterized approach that tests both checker types.

πŸ”Ž Example consolidation
var checkConditionTestCases = []struct {
    name     string
    cr       *unstructured.Unstructured
    expected bool
}{
    // ... shared test cases
}

func TestCheckCondition(t *testing.T) {
    cfg := &config.MaintenanceResource{CompleteConditionType: "Completed"}
    checker := NewCRStatusChecker(nil, nil, cfg, false)
    for _, tt := range checkConditionTestCases {
        t.Run(tt.name, func(t *testing.T) {
            assert.Equal(t, tt.expected, checker.checkCondition(tt.cr))
        })
    }
}

func TestCheckConditionCtrlRuntime(t *testing.T) {
    cfg := &config.MaintenanceResource{CompleteConditionType: "Completed"}
    checker := NewCtrlRuntimeCRStatusChecker(nil, cfg, false)
    for _, tt := range checkConditionTestCases {
        t.Run(tt.name, func(t *testing.T) {
            assert.Equal(t, tt.expected, checker.checkCondition(tt.cr))
        })
    }
}
fault-remediation/pkg/annotation/annotation_interface.go (1)

10-13: Consider adding a domain prefix to the annotation key.

Kubernetes best practices recommend using a domain prefix for custom annotations (e.g., nvsentinel.nvidia.com/latestFaultRemediationState) to avoid collisions with other tools and clearly indicate ownership.

πŸ”Ž Proposed fix
 const (
 	// AnnotationKey is the key for the node annotation that tracks remediation state
-	AnnotationKey = "latestFaultRemediationState"
+	AnnotationKey = "nvsentinel.nvidia.com/latestFaultRemediationState"
 )
fault-remediation/pkg/events/health_event.go (1)

1-3: Add package-level documentation.

Per coding guidelines, package-level godoc is required for all Go packages. Consider adding a brief description of what the events package provides.

Suggested documentation
+// Package events provides health event data types for fault remediation workflows.
 package events
 
 import "github.com/nvidia/nvsentinel/data-models/pkg/model"
fault-remediation/pkg/crstatus/deprecated_checker.go (3)

15-15: Missing package-level documentation.

Per coding guidelines, package-level godoc is required for all Go packages. Add a package comment describing the purpose of this package.

πŸ”Ž Suggested fix
+// Package crstatus provides functionality for checking the status of Custom Resources
+// to determine whether maintenance operations should be skipped based on existing CR state.
 package crstatus

50-75: Consider returning error for REST mapping failures instead of silently allowing creation.

When RESTMapping fails (line 62-66), the method logs an error but returns false, which allows CR creation to proceed. This could mask configuration issues. For the PR's goal of "throw errors to trigger retries," consider propagating this error.


77-91: Clarify the return value semantics.

When status or conditions are not found (lines 79-86), the method returns true (meaning "skip creation"). This seems counterintuitiveβ€”if the CR exists but has no status yet, skipping creation is correct. However, a brief comment explaining this logic would improve maintainability.

fault-remediation/pkg/remediation/remediation.go (1)

362-387: Potential nil pointer dereference when checking job annotations.

Line 367 checks job.Annotations != nil && job.Annotations[...], but if the job was just created, Annotations might be nil, causing the metrics to be recorded. However, if Annotations is nil, the annotation won't exist, so the condition is correct. The issue is that after updating, if the update fails (line 375-378), the function returns false, err, potentially causing duplicate metrics on retry. Consider moving the metrics recording after the successful annotation update.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

356-358: TODO comments indicate incomplete state transition handling.

Multiple TODO comments (lines 356-357, 418-419, 476-478, 507-508, 559-561) indicate that error handling for state transitions is being ignored. These should be tracked as follow-up work to ensure proper state management.

Do you want me to open a new issue to track these TODO items for proper state transition error handling?

fault-remediation/pkg/remediation/deprecated_remediation.go (1)

379-391: Missing labels on Job template spec.

Labels are set on job.Labels (line 384) but not on job.Spec.Template.Labels. This could affect label-based job selection in some scenarios, though the current List with MatchingLabels queries Job objects directly.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between fd4466e and f0f60b1.

πŸ“’ Files selected for processing (24)
  • .idea/NVSentinel.iml
  • .idea/vcs.xml
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ’€ Files with no reviewable changes (1)
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/main.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
🧠 Learnings (16)
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use separate informers for different Kubernetes resource types

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
🧬 Code graph analysis (9)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (4)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (11-14)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (7-9)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/initializer/init.go (3)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
  • FaultRemediationClientInterface (29-34)
  • TemplateData (37-44)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewCtrlRuntimeRemediationClient (47-95)
commons/pkg/statemanager/statemanager.go (1)
  • StateManager (197-200)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/main.go (3)
fault-remediation/pkg/initializer/init.go (2)
  • InitializationParams (38-44)
  • InitializeAll (51-165)
fault-remediation/pkg/reconciler/reconciler.go (1)
  • FaultRemediationReconciler (61-69)
commons/pkg/auditlogger/roundtripper.go (1)
  • NewAuditingRoundTripper (42-47)
fault-remediation/pkg/crstatus/crstatus_test.go (2)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/crstatus/checker.go (1)
  • NewCtrlRuntimeCRStatusChecker (34-44)
fault-remediation/pkg/crstatus/deprecated_checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/remediation/deprecated_remediation.go (8)
fault-remediation/pkg/config/config.go (1)
  • Template (27-30)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/crstatus/deprecated_checker.go (2)
  • CRStatusChecker (29-34)
  • NewCRStatusChecker (36-48)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
  • NewNodeAnnotationManager (38-42)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (7-9)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (11-14)
fault-remediation/pkg/common/equivalence_groups.go (1)
  • GetRemediationGroupForAction (35-45)
fault-remediation/pkg/metrics/metrics.go (3)
  • LogCollectorErrors (86-92)
  • LogCollectorJobs (71-77)
  • LogCollectorJobDuration (78-85)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (42)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

15-20: LGTM - Template structure is correct.

The Go template syntax with {{.ApiGroup}}/{{.Version}} is appropriate for template rendering. The YAMLlint error is expected since YAML linters cannot parse Go template placeholders until they are rendered.

Consider whether a namespace field should be added to metadata if this is a namespaced resource, or document that it's cluster-scoped.

fault-remediation/pkg/remediation/templates/log-collector-job.yaml (1)

16-32: Hardcoded test values - clarify if this is a test fixture or production template.

The namespace test and image test:test appear to be placeholder values. If this template is intended for production use (like rebootnode-template.yaml), these should be Go template placeholders (e.g., {{.Namespace}}, {{.Image}}). If this is purely a test fixture, consider moving it to a test data directory or adding a comment to clarify its purpose.

fault-remediation/pkg/annotation/annotation_interface.go (1)

15-32: LGTM - Clean interface and type definitions.

The interface is well-designed with clear method signatures. The data structures use appropriate JSON tags for serialization. Returning *corev1.Node alongside the state from GetRemediationState is a pragmatic choice to avoid redundant API calls in callers.

fault-remediation/pkg/events/health_event.go (1)

5-14: LGTM on struct definitions.

The two types appropriately separate JSON and BSON serialization concerns. Consider adding a comment for HealthEventDoc similar to the one on HealthEventData for consistency.

fault-remediation/main.go (2)

194-197: LGTM on auditing round tripper integration.

The HTTP transport is correctly wrapped with the auditing round tripper before manager creation, ensuring all Kubernetes API calls are audited.


228-239: LGTM on initialization and cleanup flow.

Components are properly initialized with the manager's client, and cleanup is correctly deferred to ensure datastore resources are released on exit.

fault-remediation/pkg/initializer/init.go (2)

58-60: LGTM on validation logic.

The guard correctly ensures a ctrl-runtime client is provided when ctrl-runtime mode is enabled, preventing runtime errors from missing dependencies.


88-112: LGTM on dual-mode client initialization.

The branching logic cleanly separates ctrl-runtime and Kubernetes client initialization paths, with appropriate error handling for each.

fault-remediation/pkg/remediation/remediation_test.go (2)

26-81: LGTM on client creation tests.

Good table-driven test coverage for template validation scenarios, including file existence checks and dry-run mode configuration.


217-413: LGTM on log collector job tests.

Comprehensive coverage of job lifecycle scenarios including creation, completion, failure, timeout, and duplicate job handling. The assertions properly verify both error conditions and expected job counts.

fault-remediation/pkg/metrics/metrics.go (2)

15-21: LGTM on package refactor.

Moving metrics to a dedicated package with exported identifiers enables cleaner separation of concerns and allows multiple packages to record metrics.


29-92: LGTM on exported metric variables.

Metrics are correctly exported while preserving the Prometheus metric names for backward compatibility with existing dashboards and alerts.

fault-remediation/pkg/annotation/deprecated_annotation.go (2)

82-125: LGTM on GetRemediationState signature change.

Returning the fetched node enables callers to avoid redundant API calls when they need both the annotation state and node object. The retry logic with isRetryableError properly handles transient failures.


44-60: LGTM on patchNodeWithRetry implementation.

The retry logic correctly uses exponential backoff via retry.DefaultRetry and appropriately logs retryable errors before continuing.

fault-remediation/pkg/reconciler/reconciler.go (5)

66-67: LGTM on public Config field and interface-based annotation manager.

Making Config public enables access from main.go for retry configuration. Using annotation.NodeAnnotationManagerInterface improves testability and supports both legacy and ctrl-runtime implementations.


170-193: LGTM on runLogCollector refactor.

Returning ctrl.Result and error allows proper propagation of requeue requests and errors for retry handling, aligning with the ctrl-runtime pattern.


237-244: Good use of errors.Join for combining errors.

Using errors.Join to combine createMaintenanceResourceError and label update errors ensures both failures are visible in logs and upstream error handling.


406-410: Verify error handling change aligns with retry intent.

Line 409 now returns the error instead of continuing, which will trigger retries. Per the coding guidelines, within retry.RetryOnConflict blocks, errors should not be wrapped to preserve retry behavior. However, this is outside such a block, so the unwrapped error is appropriate here.


438-441: Error propagation on RemoveGroupFromState failure.

Returning an error here will trigger retries when annotation cleanup fails. This is the intended behavior per the PR discussion to "throw error to trigger retry." The empty CR name return prevents false positive CR existence checks on retry.

fault-remediation/pkg/crstatus/deprecated_checker.go (2)

29-48: LGTM!

The struct definition and constructor follow Go conventions with proper field initialization. The use of *restmapper.DeferredDiscoveryRESTMapper aligns with the k8s.io/client-go patterns for dynamic resource mapping.


93-112: LGTM!

The findConditionStatus and isTerminal methods correctly implement the condition-checking logic. Terminal states ("True" or "False") properly indicate completion, while empty or "Unknown" states allow for CR creation retry.

fault-remediation/pkg/remediation/deprecated_remediation_test.go (5)

15-40: LGTM!

The package rename to remediation and import updates align with the PR's restructuring. The test file properly imports the new events package for HealthEventData.


197-251: Inconsistent capitalization in test strings.

Lines 198, 200, 210, 217, and 227 use inconsistent capitalization for "Config" (e.g., "in-cluster Config" vs typical "config"). This appears intentional per AI summary, but verify this matches actual error messages from the Kubernetes client libraries.


306-344: LGTM!

The test setup correctly uses the renamed public fields (Clientset, KubeClient, RestMapper, DryRunMode, Template, TemplateData) and properly configures the mock client for testing CR creation.


346-369: LGTM!

The test correctly uses events.HealthEventData and validates the updated CreateMaintenanceResource signature returning (string, error). The assertion logic properly handles both success and failure cases.


373-418: LGTM!

Tests for RunLogCollectorJob correctly use the updated signature with eventId parameter and handle the (interface{}, error) return pattern.

fault-remediation/pkg/crstatus/checker.go (2)

28-44: LGTM!

The struct and constructor properly implement the controller-runtime based CR status checker with appropriate field initialization.


71-105: LGTM!

The condition-checking methods are identical to the deprecated version, maintaining behavioral parity between the two implementations.

fault-remediation/pkg/remediation/remediation.go (2)

38-95: LGTM!

The CtrlRuntimeRemediationClient struct and constructor are well-structured. Template loading, dry-run configuration, and dependency initialization (annotation manager, status checker) are properly handled with appropriate error checking.


437-478: LGTM!

The timeout checking logic with configurable LOG_COLLECTOR_TIMEOUT environment variable and proper fallback to default is well implemented. The annotation-based guard against duplicate metrics recording is a good pattern.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)

208-217: LGTM!

The controller-runtime Manager setup with envtest follows best practices. Disabling metrics binding with BindAddress: "0" is appropriate for tests.


302-315: LGTM!

The createTestRemediationClient helper properly constructs remediation.TemplateData and uses NewCtrlRuntimeRemediationClient with the controller-runtime client.


880-888: LGTM!

The metrics assertions correctly use the dedicated metrics package constants and verify that events are properly counted across different status types (created, skipped).

fault-remediation/pkg/remediation/deprecated_remediation.go (3)

62-74: LGTM!

The struct field exports align with the API surface changes documented in the AI summary. The nodeExistsFunc override for testing is a good pattern.


165-171: LGTM!

The accessor methods properly return the interface types, enabling dependency injection and testing.


506-518: Intentional nil error return for non-fatal log collector failures.

The code explicitly returns nil error for timeout (line 509) and job completion/failure (line 517) to allow reconciliation to continue. This aligns with the PR description's intent. The slog.Error calls ensure visibility into these issues.

fault-remediation/pkg/reconciler/reconciler_test.go (6)

40-66: LGTM!

The MockK8sClient properly implements the updated interface signatures with events.HealthEventData, ctrl.Result, and the accessor methods returning interface types.


102-134: LGTM!

The MockNodeAnnotationManager correctly implements the updated GetRemediationState signature returning (*annotation.RemediationStateAnnotation, *corev1.Node, error) and uses the annotation package types.


190-234: LGTM!

The TestNewReconciler test properly uses table-driven testing and validates both dry-run enabled and disabled scenarios with the updated return signature.


336-390: LGTM!

The TestPerformRemediationWithSuccess test correctly validates the success path with the updated API, including the HealthEventDoc conversion and CR name assertion.


601-666: LGTM!

The TestRunLogCollectorJobErrorScenarios test properly validates the ctrl.Result return pattern including the requeue scenario with RequeueAfter.


946-1009: LGTM!

The TestLogCollectorOnlyCalledWhenShouldCreateCR test validates the fix for Issue #441, ensuring log collector is only called when shouldCreateCR is true to prevent duplicate jobs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

πŸ€– Fix all issues with AI Agents
In @.gitignore:
- Line 143: The .gitignore now contains a blanket `.idea/` entry but still keeps
redundant granular `.idea/*` patterns (e.g., `.idea/mongoSettings.xml`,
`.idea/replstate.xml`, `.idea/**/sonarlint/`) which are unnecessary; remove all
specific `.idea/` file and subdirectory patterns that are subsumed by the
`.idea/` rule so the file is clean and maintainable while keeping the single
`.idea/` line.

In @fault-remediation/main.go:
- Around line 171-182: The retry loop around
components.FaultRemediationReconciler.Reconcile does not observe gCtx
cancellation; modify the loop that uses UpdateMaxRetries and UpdateRetryDelay to
break/return when gCtx is done by selecting on gCtx.Done() before retrying:
after a failed Reconcile, replace the unconditional time.Sleep(...) with a
context-aware wait using select { case
<-time.After(components.FaultRemediationReconciler.Config.UpdateRetryDelay): /*
continue */ case <-gCtx.Done(): /* exit loop/return to allow graceful shutdown
*/ }, and also check <-gCtx.Done() at the top of the retry iteration to avoid
starting another attempt when the context is cancelled.

In @fault-remediation/pkg/annotation/annotation.go:
- Around line 66-74: GetRemediationState may fail and leave `node` empty; do not
swallow that error and continue using `node.DeepCopy()`. In the block handling
`state, node, err := m.GetRemediationState(ctx, nodeName)` remove the fallback
that creates an empty `RemediationStateAnnotation` and instead return the error
immediately (propagate `err`) so callers won't operate on an invalid `node`;
keep `RemediationStateAnnotation` initialization only where a valid node/state
is present and ensure any subsequent use of `node.DeepCopy()` occurs after a
successful GetRemediationState.

In @fault-remediation/pkg/crstatus/checker.go:
- Around line 58-66: The lookup uses client.ObjectKey with only Name, causing
failures for namespaced CRs; include the resource namespace when building the
key so c.client.Get(ctx, key, obj) can find namespaced objects. Update the
ObjectKey construction (used before calling c.client.Get) to set Namespace from
the MaintenanceResource (e.g., maintenanceResource.Namespace or the variable
holding the config.Namespace) while keeping Name as crName, so both namespaced
and cluster-scoped CRs are handled.

In @fault-remediation/pkg/crstatus/deprecated_checker.go:
- Around line 77-91: The checkCondition logic is inverted: in
CRStatusChecker.checkCondition you should treat missing status/conditions as
non-terminal (do not skip) and consider terminal condition statuses as reasons
to skip; change the early returns for failed/unfound unstructured.NestedMap and
NestedSlice to return false, and replace the final return of
"!c.isTerminal(conditionStatus)" with "c.isTerminal(conditionStatus)"; keep
using findConditionStatus and isTerminal to compute the conditionStatus and
determine terminality.

In @fault-remediation/pkg/events/health_event.go:
- Around line 5-8: Add a godoc comment immediately above the exported type
HealthEventDoc that briefly describes what the struct represents (e.g., a
persistent/document representation of a health event), mentions its ID field and
embedded model.HealthEventWithStatus, and any important JSON serialization
behavior; place the comment directly above the HealthEventDoc declaration so it
satisfies Go documentation guidelines.
- Around line 1-3: Add a package-level godoc comment at the top of
health_event.go describing the purpose and responsibilities of the events
package (e.g., what types of events it models and how callers should use it).
Insert a single-line or multi-line comment beginning with "Package events"
immediately above the package events declaration so the package-level
documentation appears in godoc and satisfies the coding guidelines.

In @fault-remediation/pkg/initializer/init.go:
- Line 114: The current slog.Info("Successfully initialized k8s client") is
misleading when UseCtrlRuntime is true because no k8s client is created in that
mode; change the logging around where UseCtrlRuntime is checked (referencing the
UseCtrlRuntime flag/variable and the slog.Info call) to emit a conditional
message: if UseCtrlRuntime is true log something like "Running in
controller-runtime mode; no standalone k8s client initialized", otherwise keep
"Successfully initialized k8s client". Ensure you update only the message logic
near the existing slog.Info invocation so it accurately reflects which path ran.

In @fault-remediation/pkg/metrics/metrics.go:
- Line 30: Update the TODO in metrics.go to reference a tracking issue: replace
the existing "//TODO: evaluate and remove redundant metrics with ctrl-runtime
defaults" with a TODO that includes the issue ID and brief context (e.g. "//
TODO(issue-1234): evaluate and remove redundant metrics with ctrl-runtime
defaults") so the task is traceable; ensure the issue ID matches the repo's
issue tracker and keep the explanatory text unchanged beyond adding the issue
reference.

In @fault-remediation/pkg/remediation/deprecated_remediation.go:
- Around line 266-293: The code updates the node annotation twice: once using
crName and again using actualCRName; remove the first UpdateRemediationState
call (the block that uses crName immediately after computing group :=
common.GetRemediationGroupForAction(...)) so only the subsequent update that
uses actualCRName := createdCR.GetName() remains; ensure you keep the group
computation and the second UpdateRemediationState call (and its error handling)
intact and remove the duplicate slog.Warn/return branch associated with the
first update.

In @fault-remediation/pkg/remediation/remediation.go:
- Around line 310-317: The create path in the reconciler currently calls
c.client.Create(ctx, job) after checking len(existingJobs.Items) == 0 and does
not handle the race where another reconciler created the job and Create returns
an already-exists error; update the c.client.Create error handling to detect
apierrors.IsAlreadyExists(err) (k8s.io/apimachinery/pkg/api/errors) and treat
that case as success by returning ctrl.Result{RequeueAfter: 10 * time.Second},
nil (same behavior as successful create) instead of returning the error, while
still returning real errors unchanged.
- Around line 183-208: The first call that updates the node annotation using the
provisional crName is redundant; remove the initial block that calls
annotationManager.UpdateRemediationState with crName (the block after group :=
common.GetRemediationGroupForAction(healthEvent.RecommendedAction) that uses
crName) and keep only the later update that uses the authoritative actualCRName
from maintenance.GetName(); ensure you still check group != "" and
c.annotationManager != nil and return the existing error handling (slog.Warn and
return err) when the single UpdateRemediationState call fails.

In @fault-remediation/pkg/remediation/templates/log-collector-job.yaml:
- Line 20: Replace the hardcoded namespace value "test" in the
log-collector-job.yaml template with a parameterized Go template variable (e.g.,
change the "namespace: test" entry to use "{{.Namespace}}"), update the
template's expected data model to include Namespace, and ensure any code or
deployment that renders this template (the caller that supplies template data)
passes the Namespace value; this mirrors the approach used in
rebootnode-template.yaml so the template becomes environment-agnostic.
- Line 32: The job template currently hardcodes the placeholder image string
"test:test"; replace this with a real image reference or a templated parameter
so deployments use the correct log-collector image. Update the image field in
the log-collector job template (the line containing "image: test:test") to
reference a configurable variable (e.g., a Helm value or environment variable
like logCollector.image / .Values.logCollector.image) or the intended production
image name, and ensure any charts/values or deployment tooling are updated to
provide that value.
🧹 Nitpick comments (6)
fault-remediation/pkg/events/health_event.go (1)

10-14: Clarify godoc comment to distinguish HealthEventData from HealthEventDoc.

The godoc comment doesn't explain the distinction between HealthEventData (BSON-tagged) and HealthEventDoc (JSON-tagged). Consider documenting the intended use case for each type to improve maintainability.

πŸ”Ž Proposed improvement
-// HealthEventData represents health event data with string ID for compatibility
+// HealthEventData represents health event data with BSON "_id,omitempty" tag for MongoDB storage.
+// Use HealthEventDoc for JSON-based representations.
 type HealthEventData struct {
 	ID                          string `bson:"_id,omitempty"`
 	model.HealthEventWithStatus `bson:",inline"`
 }
fault-remediation/pkg/crstatus/crstatus_test.go (1)

120-212: Consider reducing test duplication with a helper function.

TestCheckConditionCtrlRuntime duplicates ~90 lines from TestCheckCondition. Consider extracting a helper function that accepts the checker as a parameter to test both implementations with the same test cases, improving maintainability and ensuring test case parity.

πŸ”Ž Proposed refactor
+func testCheckCondition(t *testing.T, checker interface {
+	checkCondition(*unstructured.Unstructured) bool
+}) {
+	tests := []struct {
+		name     string
+		cr       *unstructured.Unstructured
+		expected bool
+	}{
+		{
+			name: "no status returns skip - in progress",
+			cr: &unstructured.Unstructured{
+				Object: map[string]any{
+					"metadata": map[string]any{"name": "test-cr"},
+				},
+			},
+			expected: true,
+		},
+		// ... rest of test cases
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			result := checker.checkCondition(tt.cr)
+			assert.Equal(t, tt.expected, result)
+		})
+	}
+}
+
 func TestCheckCondition(t *testing.T) {
 	cfg := &config.MaintenanceResource{
 		CompleteConditionType: "Completed",
 	}
 	checker := NewCRStatusChecker(nil, nil, cfg, false)
-
-	tests := []struct {
-		// ... test cases
-	}
-
-	for _, tt := range tests {
-		t.Run(tt.name, func(t *testing.T) {
-			result := checker.checkCondition(tt.cr)
-			assert.Equal(t, tt.expected, result)
-		})
-	}
+	testCheckCondition(t, checker)
 }
 
 func TestCheckConditionCtrlRuntime(t *testing.T) {
 	cfg := &config.MaintenanceResource{
 		CompleteConditionType: "Completed",
 	}
 	checker := NewCtrlRuntimeCRStatusChecker(nil, cfg, false)
-
-	tests := []struct {
-		// ... duplicate test cases
-	}
-
-	for _, tt := range tests {
-		t.Run(tt.name, func(t *testing.T) {
-			result := checker.checkCondition(tt.cr)
-			assert.Equal(t, tt.expected, result)
-		})
-	}
+	testCheckCondition(t, checker)
 }
fault-remediation/pkg/initializer/init.go (1)

88-112: Consider extracting TemplateData initialization to reduce duplication.

The TemplateData struct initialization (lines 91-95 and 103-106) is duplicated across both initialization paths. Extract it to a variable before the conditional to improve maintainability.

πŸ”Ž Proposed refactor
+	templateData := remediation.TemplateData{
+		TemplateMountPath:   tomlConfig.Template.MountPath,
+		TemplateFileName:    tomlConfig.Template.FileName,
+		MaintenanceResource: tomlConfig.MaintenanceResource,
+	}
+
 	if params.UseCtrlRuntime {
 		remediationClient, err = remediation.NewCtrlRuntimeRemediationClient(
 			ctrlruntimeClient,
-			params.DryRun, remediation.TemplateData{
-				TemplateMountPath:   tomlConfig.Template.MountPath,
-				TemplateFileName:    tomlConfig.Template.FileName,
-				MaintenanceResource: tomlConfig.MaintenanceResource,
-			})
+			params.DryRun,
+			templateData)
 		if err != nil {
 			return nil, fmt.Errorf("error while initializing ctrl runtime client: %w", err)
 		}
 	} else {
 		remediationClient, clientSet, err = remediation.NewK8sClient(
 			params.KubeconfigPath,
 			params.DryRun,
-			remediation.TemplateData{
-				TemplateMountPath:   tomlConfig.Template.MountPath,
-				TemplateFileName:    tomlConfig.Template.FileName,
-				MaintenanceResource: tomlConfig.MaintenanceResource,
-			},
+			templateData,
 		)
 		if err != nil {
 			return nil, fmt.Errorf("error while initializing kubernetes client: %w", err)
 		}
 	}
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

17-54: Consider grouping imports according to Go conventions.

The imports have non-standard ordering with context followed by a blank line, then log. Standard Go convention groups stdlib imports together, then a blank line, then external packages, then a blank line, then internal packages.


1109-1111: Avoid using time.Sleep for synchronization in tests.

Using time.Sleep(500 * time.Millisecond) for waiting on event processing is fragile. Consider using assert.Eventually consistently for better test reliability.

πŸ”Ž Suggested approach
-	// Allow time for second event to be processed (should be deduplicated)
-	time.Sleep(500 * time.Millisecond)
+	// Wait for second event to be processed (should be deduplicated)
+	time.Sleep(100 * time.Millisecond) // Brief pause to allow processing

Or better, add an Eventually check for a specific condition that indicates processing completed.

fault-remediation/pkg/remediation/templates/log-collector-job.yaml (1)

28-32: Consider adding resource limits and additional parameterization.

To improve cluster stability and operational flexibility:

  1. Define CPU and memory resource requests and limits for the log-collector container
  2. Consider parameterizing additional fields such as ttlSecondsAfterFinished, serviceAccountName, and container arguments/env vars if they vary across deployments
πŸ”Ž Example resource limits
       containers:
         - name: log-collector
           image: {{.Image}}
+          resources:
+            requests:
+              memory: "128Mi"
+              cpu: "100m"
+            limits:
+              memory: "256Mi"
+              cpu: "200m"
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between f0f60b1 and b05f1bf.

πŸ“’ Files selected for processing (23)
  • .gitignore
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ’€ Files with no reviewable changes (1)
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/annotation/annotation_interface.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
🧠 Learnings (17)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use separate informers for different Kubernetes resource types

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
🧬 Code graph analysis (11)
fault-remediation/pkg/initializer/init.go (4)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
  • FaultRemediationClientInterface (29-34)
  • TemplateData (37-44)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewCtrlRuntimeRemediationClient (47-95)
fault-remediation/pkg/config/config.go (2)
  • Template (27-30)
  • MaintenanceResource (18-24)
commons/pkg/statemanager/statemanager.go (2)
  • StateManager (197-200)
  • NewStateManager (206-210)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/crstatus/crstatus_test.go (2)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/crstatus/checker.go (1)
  • NewCtrlRuntimeCRStatusChecker (34-44)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (8)
health-monitors/kubernetes-object-monitor/pkg/cel/environment.go (1)
  • Environment (30-35)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (41-43)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
  • FaultRemediationClientInterface (29-34)
  • TemplateData (37-44)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewCtrlRuntimeRemediationClient (47-95)
commons/pkg/statemanager/statemanager.go (2)
  • StateManager (197-200)
  • RemediatingLabelValue (171-171)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventDoc (5-8)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • AnnotationKey (12-12)
fault-remediation/pkg/metrics/metrics.go (4)
  • TotalEventsReceived (33-38)
  • EventHandlingDuration (62-68)
  • EventsProcessed (39-45)
  • ProcessingErrors (46-52)
fault-remediation/pkg/crstatus/deprecated_checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/remediation/deprecated_remediation_test.go (4)
fault-remediation/pkg/config/config.go (1)
  • Template (27-30)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • TemplateData (37-44)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (11-14)
fault-remediation/pkg/remediation/deprecated_remediation.go (1)
  • FaultRemediationClient (62-74)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/reconciler/reconciler_test.go (4)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventData (11-14)
  • HealthEventDoc (5-8)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • NodeAnnotationManagerInterface (16-21)
  • RemediationStateAnnotation (24-26)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (7-9)
data-models/pkg/protos/health_event.pb.go (2)
  • RecommendedAction_RESTART_BM (48-48)
  • RecommendedAction_NONE (43-43)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/reconciler/reconciler.go (8)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • FaultRemediationClientInterface (29-34)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/metrics/metrics.go (7)
  • EventHandlingDuration (62-68)
  • TotalEventsReceived (33-38)
  • TotalUnsupportedRemediationActions (53-59)
  • ProcessingErrors (46-52)
  • EventsProcessed (39-45)
  • CRStatusSkipped (26-26)
  • CRStatusCreated (25-25)
commons/pkg/statemanager/statemanager.go (1)
  • StateManager (197-200)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_NONE (43-43)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventDoc (5-8)
  • HealthEventData (11-14)
data-models/pkg/model/health_event_extentions.go (1)
  • HealthEventWithStatus (53-57)
store-client/pkg/client/convenience.go (1)
  • UpdateHealthEventStatus (31-35)
fault-remediation/pkg/remediation/deprecated_remediation.go (6)
fault-remediation/pkg/config/config.go (1)
  • Template (27-30)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (7-9)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (11-14)
fault-remediation/pkg/common/equivalence_groups.go (1)
  • GetRemediationGroupForAction (35-45)
fault-remediation/pkg/metrics/metrics.go (2)
  • LogCollectorJobs (71-77)
  • LogCollectorJobDuration (78-85)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (29)
.gitignore (1)

143-143: ⚠️ Scope mismatch: .gitignore changes don't align with PR objectives.

This file's changes focus on consolidating IDE configuration patterns, which is unrelated to the PR's stated objectives around remediation retry behavior and error handling. The PR summary describes refactoring fault remediation with dual execution modes and controller-runtime integration, not .gitignore cleanup.

If this .gitignore consolidation is intentional as part of broader housekeeping, clarify this in the PR description. Alternatively, consider submitting it as a separate, focused PR to keep scopes distinct.

fault-remediation/pkg/metrics/metrics.go (1)

33-92: LGTM: Metric variable exports are well-structured.

The renamed metric variables follow Go naming conventions and properly integrate with the controller-runtime metrics registry. The metric definitions (names, help texts, labels) are preserved correctly.

fault-remediation/pkg/annotation/deprecated_annotation.go (3)

83-125: LGTM: GetRemediationState signature change properly implemented.

The updated method signature correctly returns *corev1.Node alongside the remediation state. All return paths (success and error) properly handle the additional return value, and error wrapping follows Go conventions with %w.


131-131: LGTM: Call sites correctly updated for new GetRemediationState signature.

Both UpdateRemediationState and RemoveGroupFromState properly handle the additional *corev1.Node return value by using _ to discard it, which is appropriate since these methods don't require the node reference.

Also applies to: 186-186


45-80: LGTM: Robust retry logic with appropriate error classification.

The retry implementation correctly handles common transient Kubernetes API errors (conflicts, timeouts, rate limiting, service unavailability) with exponential backoff. Error wrapping follows Go conventions with %w to preserve error chains.

fault-remediation/pkg/initializer/init.go (3)

58-60: LGTM: Proper validation of ctrl-runtime client.

The validation correctly ensures that a ctrl-runtime client is provided when UseCtrlRuntime is enabled, preventing nil pointer errors in the ctrl-runtime initialization path.


148-157: LGTM: Reconciler configuration properly wired.

The reconciler configuration correctly integrates the remediation client abstraction, retry settings, and log collector flag. (Note: StateManager nil issue addressed in separate comment.)


152-153: StateManager will panic with nil clientSet in ctrl-runtime mode.

When UseCtrlRuntime is true, clientSet remains nil but is passed to statemanager.NewStateManager(clientSet). StateManager methods directly call clientSet.CoreV1().Nodes().Get() without nil checks, causing a panic at runtime. The TODO acknowledges this but doesn't prevent the runtime error.

Either implement a ctrl-runtime version of StateManager or add nil checks to guard StateManager method calls in ctrl-runtime mode.

Likely an incorrect or invalid review comment.

fault-remediation/pkg/remediation/deprecated_remediation_test.go (1)

15-476: Test updates correctly reflect the refactored API.

The test changes appropriately adapt to the new public API surface, including:

  • Field capitalizations (Clientset, KubeClient, RestMapper, DryRunMode, Template, TemplateData)
  • Updated CreateMaintenanceResource signature returning (string, error) and accepting events.HealthEventData
  • Updated RunLogCollectorJob signature including eventId parameter and returning (ctrl.Result, error)

The test logic and assertions remain sound.

fault-remediation/pkg/remediation/remediation.go (2)

47-95: LGTM: Well-structured constructor with proper validation.

The constructor correctly:

  • Validates template file existence before reading
  • Handles template parsing errors
  • Initializes dry-run mode appropriately
  • Sets up annotation manager and status checker components

322-478: LGTM: Robust log collector status checking with metric guards.

The status checking implementation correctly:

  • Handles complete, failed, and timeout states separately
  • Uses annotation-based guards to prevent duplicate metric recording across reconciliations
  • Configures timeout via environment variable with safe fallback
  • Requeues appropriately when job is still running
fault-remediation/pkg/crstatus/checker.go (2)

28-44: LGTM - Clean refactoring to controller-runtime client.

The struct and constructor are well-structured with the controller-runtime client integration. The simplified field set improves maintainability.


71-85: LGTM - Condition checking logic is correct.

The checkCondition method properly handles missing status/conditions by returning true (allowing CR creation), and correctly delegates to isTerminal for status evaluation.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

208-217: LGTM - Controller-runtime manager setup is correct.

The manager is properly configured with the test environment config, scheme, and disabled metrics server (BindAddress: "0"). The client is correctly obtained from the manager.


302-315: LGTM - Test remediation client creation properly uses new API.

The createTestRemediationClient correctly constructs remediation.TemplateData and uses remediation.NewCtrlRuntimeRemediationClient with the controller-runtime client.

fault-remediation/pkg/remediation/deprecated_remediation.go (2)

506-517: Verify intentional error suppression for log collector failures.

The code logs errors but returns nil for both timeout and job failure cases. Based on the past review discussion, this is intentional to allow reconciliation to continue. However, consider whether returning ctrl.Result{Requeue: true} might be more appropriate for transient failures.

Confirm that swallowing errors here aligns with the intended behavior discussed in past reviews - allowing remediation to proceed even if log collection fails.


62-74: LGTM - Struct fields properly exposed with consistent naming.

The FaultRemediationClient struct fields are well-organized with clear naming conventions. The nodeExistsFunc allows for test overrides.

fault-remediation/pkg/reconciler/reconciler.go (6)

203-211: Error propagation enables retry behavior as intended.

Returning the error from UpdateNVSentinelStateNodeLabel allows controller-runtime to retry the reconciliation. This aligns with the PR objective.


220-242: Good error aggregation pattern with errors.Join.

The code properly handles the case where CR creation fails by:

  1. Recording the error but continuing to update state
  2. Using errors.Join to combine errors when both operations fail
  3. Returning the CR creation error after state update

This ensures state is updated even on failure while still propagating errors for retry.


406-410: Error propagation from GetRemediationState enables retry.

Previously this may have silently continued; now it returns the error to trigger a retry. The underscore for the unused node return value is appropriate.


438-441: Error propagation from RemoveGroupFromState enables retry.

Returning false, "", err instead of just logging ensures the reconciler will retry on annotation update failures.


328-330: Verify log collector error handling aligns with intended behavior.

The code returns early if err != nil or !result.IsZero(). Per past review discussions, log collector failures should not block remediation. Confirm this matches the current implementation in RunLogCollectorJob.

Based on the deprecated_remediation.go changes, RunLogCollectorJob now returns nil error on timeout/failure, so this early return would only trigger on actual API errors during job creation.


332-344: Good error handling with status propagation.

The pattern of:

  1. Attempting remediation
  2. Updating status regardless of outcome
  3. Joining errors if both fail
  4. Returning the original error

Ensures status is always updated while still triggering retries on failure.

fault-remediation/pkg/reconciler/reconciler_test.go (5)

40-66: LGTM - Mock implementations properly updated for new interfaces.

The MockK8sClient correctly implements the updated FaultRemediationClientInterface with the new return types for CreateMaintenanceResource and RunLogCollectorJob.


102-134: LGTM - MockNodeAnnotationManager implements the new interface correctly.

The mock properly returns the 3-tuple (*RemediationStateAnnotation, *corev1.Node, error) from GetRemediationState, with nil for the node which is appropriate for unit tests.


383-390: LGTM - Test properly constructs HealthEventDoc from HealthEventData.

The conversion pattern from events.HealthEventData to events.HealthEventDoc is correctly applied in the test.


443-446: Test correctly expects error on remediation failure.

The test verifies that performRemediation returns an error when CreateMaintenanceResource fails, confirming the retry behavior.


489-491: Test correctly expects error on label update failure.

The test now expects an error when UpdateNVSentinelStateNodeLabel fails, validating the new retry-on-error behavior introduced by this PR.

fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

1-20: Template is missing required force field in spec.

The RebootNode CRD (janitor.dgxc.nvidia.com/v1alpha1) requires both force (boolean, defaults to false) and nodeName fields in the spec. Your template only includes nodeName, which will cause validation errors when creating RebootNode resources.

Add the missing field:

Required fix
apiVersion: {{.ApiGroup}}/{{.Version}}
kind: RebootNode
metadata:
  name: maintenance-{{.NodeName}}-{{.HealthEventID}}
spec:
  nodeName: {{.NodeName}}
  force: false

Also verify that ApiGroup and Version placeholders are populated with janitor.dgxc.nvidia.com and v1alpha1 respectively during template rendering.

β›” Skipped due to learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

πŸ€– Fix all issues with AI agents
In @commons/pkg/statemanager/statemanager_test.go:
- Around line 281-302: Remove the dead commented-out test block for
TestUpdateNVSentinelStateNodeLabelWithUpdateFailureCtrlRuntime: delete the
entire commented function (including references to ctrlRuntimeStateManager,
fakeClient, and the UpdateNVSentinelStateNodeLabel calls) from
statemanager_test.go; if the test work is intended later, open a tracking issue
referencing this test name and move any needed notes there instead of leaving
commented code.

In @commons/pkg/statemanager/statemanager.go:
- Around line 310-388: The
ctrlRuntimeStateManager.UpdateNVSentinelStateNodeLabel implementation lacks
retry-on-conflict handling; wrap the Get/Modify/Update sequence in a retry loop
(e.g., retry.OnError with errors.IsConflict) so transient optimistic-concurrency
conflicts are retried, introduce a local nodeModified bool to track whether the
label was actually changed/removed inside the retry closure, and ensure the
function returns that nodeModified and the final error from the retry call; keep
using validateStateTransition, manager.client.Get and manager.client.Update
inside the retry closure and return validationErr only after a successful
update.

In @fault-remediation/main.go:
- Around line 171-182: The retry loop around
components.FaultRemediationReconciler.Reconcile does not respect gCtx
cancellation because it uses time.Sleep; change the backoff to a context-aware
wait by replacing the
time.Sleep(components.FaultRemediationReconciler.Config.UpdateRetryDelay) with a
select that waits on
time.After(components.FaultRemediationReconciler.Config.UpdateRetryDelay) and on
gCtx.Done(), and if gCtx is cancelled return or break out of the loop; ensure
you check gCtx.Done() before each retry and abort retries when the context is
done so graceful shutdown on SIGTERM is honored.

In @fault-remediation/pkg/annotation/annotation.go:
- Around line 66-74: GetRemediationState can fail and currently the code
swallows the error and proceeds with an empty RemediationStateAnnotation while
using the invalid node value returned from the failed call later (e.g.,
node.DeepCopy()), which can cause incorrect behavior; change the handling in the
caller so that when m.GetRemediationState(ctx, nodeName) returns an error you
immediately return that error (or wrap and return it) instead of continuing with
an empty state, ensuring you do not call methods like node.DeepCopy() on the
invalid node; update the function containing this logic to propagate the error
from GetRemediationState rather than creating a placeholder
RemediationStateAnnotation.

In @fault-remediation/pkg/crstatus/checker.go:
- Around line 58-66: The ObjectKey used for c.client.Get is missing the
Namespace, so lookups for namespaced CRs fail; update the key construction (the
client.ObjectKey passed to c.client.Get where obj, gvk and crName are used) to
include the namespace from the MaintenanceResource (e.g., use the resource's
Namespace field or crNamespace variable) so the lookup uses both Name and
Namespace before calling c.client.Get; ensure the Namespace is set only when
non-empty for cluster-scoped resources.

In @fault-remediation/pkg/crstatus/crstatus_interface.go:
- Around line 1-9: The package crstatus lacks a package-level godoc comment; add
a brief package comment immediately above "package crstatus" describing the
package's purpose (e.g., utilities for checking/handling CR status), mention the
exported interface CRStatusCheckerInterface and its method
ShouldSkipCRCreation(context.Context, string) so docs are clear, keeping the
comment concise and in godoc style.

In @fault-remediation/pkg/crstatus/deprecated_checker.go:
- Around line 77-91: The checkCondition method in CRStatusChecker has inverted
skip logic: when status or conditions are missing (in the nested map/slice
checks in checkCondition) it currently returns true but should return false
(treat missing as non-terminal), and the final return should not negate
isTerminal β€” replace "return !c.isTerminal(conditionStatus)" with "return
c.isTerminal(conditionStatus)"; locate these in checkCondition (calls to
unstructured.NestedMap, unstructured.NestedSlice, findConditionStatus and
isTerminal) and invert those boolean returns accordingly.

In @fault-remediation/pkg/events/health_event.go:
- Around line 5-8: Add a godoc comment immediately above the exported type
HealthEventDoc describing its purpose and fields; mention that it represents a
HealthEvent document with an ID and inlined model.HealthEventWithStatus (so
readers understand the JSON tags and inline embedding). Ensure the comment
starts with "HealthEventDoc" and is a complete sentence per Go conventions.
- Around line 1-3: Add a package-level godoc comment above the package
declaration for package events in health_event.go that briefly documents the
package purpose and intended usage (e.g., what health events are represented and
how consumers should use this package). Ensure the comment is a full sentence
starting with "Package events ..." and sits immediately above the existing
"package events" line so godoc tools pick it up; update any existing top-of-file
comments if present to follow the "Package events ..." convention.

In @fault-remediation/pkg/initializer/init.go:
- Line 117: The log message "Successfully initialized k8s client" is misleading
when UseCtrlRuntime is true; update the logging in the client initialization
(where slog.Info is called) to check the UseCtrlRuntime flag and log a precise
message (e.g., "Successfully initialized ctrl-runtime remediation client" when
UseCtrlRuntime is true, otherwise "Successfully initialized k8s client"),
referencing the UseCtrlRuntime boolean and the existing slog.Info call to locate
the spot to change.

In @fault-remediation/pkg/remediation/deprecated_remediation.go:
- Around line 266-293: The code calls UpdateRemediationState twiceβ€”first using
crName and then again using actualCRName from createdCR.GetName(); remove the
first redundant update (the block that uses crName) so only the second
UpdateRemediationState call remains, keeping the group lookup via
common.GetRemediationGroupForAction(healthEvent.RecommendedAction) and the
c.AnnotationManager checks intact and returning errors as currently done in the
second block.

In @fault-remediation/pkg/remediation/remediation.go:
- Around line 183-192: Remove the redundant provisional annotation update:
delete the block that calls c.annotationManager.UpdateRemediationState(ctx,
healthEvent.NodeName, group, crName) (the update using the provisional crName)
and its error handling; keep the later authoritative update that uses
actualCRName retrieved from maintenance.GetName() so only the single, final
UpdateRemediationState call remains. Ensure any related variables (crName) are
still set if needed elsewhere, but do not perform the early
UpdateRemediationState call with the provisional name.
🧹 Nitpick comments (9)
fault-remediation/pkg/crstatus/crstatus_test.go (1)

120-212: Consider extracting shared test cases to reduce duplication.

The test logic is correct and comprehensive. However, the test cases are identical to TestCheckCondition (lines 26-118). Consider extracting the shared test cases into a variable or helper function to reduce duplication and improve maintainability.

♻️ Proposed refactor to share test cases
+// Shared test cases for both checker variants
+func getConditionTestCases() []struct {
+	name     string
+	cr       *unstructured.Unstructured
+	expected bool
+} {
+	return []struct {
+		name     string
+		cr       *unstructured.Unstructured
+		expected bool
+	}{
+		{
+			name: "no status returns skip - in progress",
+			cr: &unstructured.Unstructured{
+				Object: map[string]any{
+					"metadata": map[string]any{"name": "test-cr"},
+				},
+			},
+			expected: true,
+		},
+		// ... rest of test cases
+	}
+}
+
 func TestCheckCondition(t *testing.T) {
 	cfg := &config.MaintenanceResource{
 		CompleteConditionType: "Completed",
 	}
 	checker := NewCRStatusChecker(nil, nil, cfg, false)
-
-	tests := []struct {
-		name     string
-		cr       *unstructured.Unstructured
-		expected bool
-	}{
-		// ... test cases
-	}
+	tests := getConditionTestCases()
 
 	for _, tt := range tests {
 		t.Run(tt.name, func(t *testing.T) {
@@ -113,93 +115,7 @@
 
 func TestCheckConditionCtrlRuntime(t *testing.T) {
 	// ... same setup
 	checker := NewCtrlRuntimeCRStatusChecker(nil, cfg, false)
-
-	tests := []struct {
-		name     string
-		cr       *unstructured.Unstructured
-		expected bool
-	}{
-		// ... identical test cases
-	}
+	tests := getConditionTestCases()
 
 	for _, tt := range tests {
 		// ... same loop
commons/pkg/statemanager/statemanager_test.go (1)

17-32: Import ordering does not follow Go conventions.

Standard library imports should be grouped separately from third-party imports. The k8s.io/apimachinery/pkg/types import on line 20 is mixed with the standard library block.

♻️ Suggested fix
 import (
 	"context"
 	"fmt"
-	"k8s.io/apimachinery/pkg/types"
 	"testing"
 
 	"github.com/stretchr/testify/assert"
 	v1 "k8s.io/api/core/v1"
 	"k8s.io/apimachinery/pkg/api/errors"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/runtime"
 	"k8s.io/apimachinery/pkg/runtime/schema"
+	"k8s.io/apimachinery/pkg/types"
 	"k8s.io/client-go/kubernetes/fake"
 	ktesting "k8s.io/client-go/testing"
 	ctrlFake "sigs.k8s.io/controller-runtime/pkg/client/fake"
 )
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)

15-15: Missing package-level godoc.

As per coding guidelines, package-level godoc is required for all Go packages.

πŸ“ Proposed fix
+// Package remediation provides interfaces and implementations for fault remediation
+// operations, including maintenance resource creation and log collection.
 package remediation

31-31: Parameter name should use Go acronym convention.

Go convention is to use ID (all caps) for acronyms, not Id.

♻️ Proposed fix
-	RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error)
+	RunLogCollectorJob(ctx context.Context, nodeName string, eventID string) (ctrl.Result, error)
commons/pkg/statemanager/statemanager.go (1)

144-156: Import ordering does not follow Go conventions.

Standard library and third-party imports are mixed. Group standard library imports together, separated from third-party imports.

♻️ Proposed fix
 import (
 	"context"
 	"fmt"
-	corev1 "k8s.io/api/core/v1"
-	"k8s.io/apimachinery/pkg/types"
 	"log/slog"
-	"sigs.k8s.io/controller-runtime/pkg/client"
 
+	corev1 "k8s.io/api/core/v1"
 	"k8s.io/apimachinery/pkg/api/errors"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/apimachinery/pkg/types"
 	"k8s.io/client-go/kubernetes"
 	"k8s.io/client-go/util/retry"
+	"sigs.k8s.io/controller-runtime/pkg/client"
 )
fault-remediation/pkg/annotation/annotation.go (1)

1-1: Missing package-level godoc.

As per coding guidelines, package-level godoc is required for all Go packages.

πŸ“ Proposed fix
+// Package annotation provides node annotation management for tracking remediation state
+// across fault remediation operations.
 package annotation
fault-remediation/main.go (1)

199-199: TODO comment should reference an issue.

As per coding guidelines, TODO comments in Go code should reference issues for tracking.

πŸ“ Proposed fix
-	//TODO: setup informers for node and job
+	//TODO(#issue_number): setup informers for node and job
fault-remediation/pkg/remediation/deprecated_remediation.go (1)

208-210: Add context when returning template execution error.

Per coding guidelines, wrap errors with context for better traceability.

πŸ“ Suggested enhancement
 	if err = c.Template.Execute(&buf, c.TemplateData); err != nil {
 		slog.Error("Failed to execute maintenance Template", "error", err)
-		return "", err
+		return "", fmt.Errorf("failed to execute maintenance template: %w", err)
 	}
fault-remediation/pkg/reconciler/reconciler.go (1)

203-211: Consider adding context when returning label update error.

Per coding guidelines, wrapping errors with context improves traceability.

πŸ“ Suggested enhancement
 	_, err := r.Config.StateManager.UpdateNVSentinelStateNodeLabel(ctx,
 		healthEventWithStatus.HealthEvent.NodeName,
 		statemanager.RemediatingLabelValue, false)
 	if err != nil {
 		slog.Error("Error updating node label to remediating", "error", err)
 		metrics.ProcessingErrors.WithLabelValues("label_update_error", nodeName).Inc()
 
-		return "", err
+		return "", fmt.Errorf("failed to update node label to remediating: %w", err)
 	}
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between b05f1bf and 8788f69.

β›” Files ignored due to path filters (1)
  • commons/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (26)
  • .gitignore
  • commons/go.mod
  • commons/pkg/statemanager/statemanager.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ’€ Files with no reviewable changes (1)
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • fault-remediation/pkg/remediation/remediation_test.go
  • .gitignore
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/annotation/annotation_interface.go
🧰 Additional context used
πŸ““ Path-based instructions (3)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/main.go
  • commons/pkg/statemanager/statemanager.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/go.mod

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

  • commons/go.mod
🧠 Learnings (25)
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • fault-remediation/pkg/events/health_event.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/metrics/metrics.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2026-01-06T21:31:36.113Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.

Applied to files:

  • commons/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • commons/go.mod
  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

  • commons/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

  • commons/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • commons/pkg/statemanager/statemanager_test.go
  • commons/pkg/statemanager/statemanager.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use separate informers for different Kubernetes resource types

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧬 Code graph analysis (10)
fault-remediation/pkg/crstatus/crstatus_test.go (2)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/crstatus/checker.go (1)
  • NewCtrlRuntimeCRStatusChecker (34-44)
commons/pkg/statemanager/statemanager_test.go (1)
commons/pkg/statemanager/statemanager.go (9)
  • QuarantinedLabelValue (166-166)
  • NVSentinelStateLabelKey (159-159)
  • DrainingLabelValue (169-169)
  • NVSentinelStateLabelValue (162-162)
  • DrainSucceededLabelValue (170-170)
  • DrainFailedLabelValue (171-171)
  • RemediatingLabelValue (174-174)
  • RemediationSucceededLabelValue (175-175)
  • RemediationFailedLabelValue (176-176)
fault-remediation/main.go (3)
fault-remediation/pkg/reconciler/reconciler.go (1)
  • FaultRemediationReconciler (61-69)
commons/pkg/auditlogger/roundtripper.go (1)
  • NewAuditingRoundTripper (42-47)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (41-43)
fault-remediation/pkg/crstatus/deprecated_checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/remediation/deprecated_remediation_test.go (4)
fault-remediation/pkg/config/config.go (1)
  • Template (27-30)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • TemplateData (37-44)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (11-14)
fault-remediation/pkg/remediation/deprecated_remediation.go (1)
  • FaultRemediationClient (62-74)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/reconciler/reconciler_test.go (4)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventData (11-14)
  • HealthEventDoc (5-8)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • NodeAnnotationManagerInterface (16-21)
  • RemediationStateAnnotation (24-26)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (7-9)
data-models/pkg/protos/health_event.pb.go (2)
  • RecommendedAction_RESTART_BM (48-48)
  • RecommendedAction_NONE (43-43)
fault-remediation/pkg/reconciler/reconciler.go (8)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • FaultRemediationClientInterface (29-34)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/metrics/metrics.go (5)
  • EventHandlingDuration (62-68)
  • TotalEventsReceived (33-38)
  • TotalUnsupportedRemediationActions (53-59)
  • ProcessingErrors (46-52)
  • EventsProcessed (39-45)
commons/pkg/statemanager/statemanager.go (1)
  • StateManager (200-203)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_NONE (43-43)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventDoc (5-8)
  • HealthEventData (11-14)
data-models/pkg/model/health_event_extentions.go (1)
  • HealthEventWithStatus (53-57)
store-client/pkg/storewatcher/watch_store.go (1)
  • ChangeStreamWatcher (43-57)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (37)
commons/go.mod (1)

26-26: Dual json-patch versions are expected transitive dependencies and require no action.

Both github.com/evanphx/json-patch/v5 v5.9.11 and gopkg.in/evanphx/json-patch.v4 v4.13.0 are indirect (transitive) dependencies pulled in by different upstream packages. The presence of both versions is normal in Go module management when different dependencies require different versions. Neither is directly used in the commons module code, and consolidation is not possible without changing upstream dependencies. The v5.9.11 version is stable and actively maintained.

Likely an incorrect or invalid review comment.

fault-remediation/pkg/crstatus/checker.go (3)

28-44: LGTM!

The type definition and constructor are well-structured, correctly using the controller-runtime client and properly initializing all fields.


71-85: LGTM!

The condition checking logic correctly handles missing status/conditions and delegates to helper methods for status evaluation.


87-106: LGTM!

The helper methods correctly implement condition status lookup and terminal state checking with appropriate type assertions and fallback behavior.

fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

1-20: LGTM! Static analysis error is a false positive.

The template structure is correct. The YAMLlint syntax error on line 15 is a false positiveβ€”the file uses Go template placeholders (e.g., {{.ApiGroup}}) that will be rendered before being parsed as YAML.

fault-remediation/pkg/events/health_event.go (1)

10-14: LGTM!

The HealthEventData type is properly documented and the struct definition is correct with appropriate BSON tags for MongoDB compatibility.

commons/pkg/statemanager/statemanager_test.go (5)

269-279: LGTM!

Good test coverage for the ctrl-runtime path's Get failure scenario, using the fake client without pre-created objects to simulate a node not found error.


304-330: LGTM!

Comprehensive test for adding a label via the ctrl-runtime path with proper verification of the final node state.


332-361: LGTM!

Proper test for label removal with verification that the label no longer exists on the node.


449-495: LGTM!

Good table-driven test pattern for verifying label removal from all possible states works without validation errors in the ctrl-runtime path.


497-572: LGTM!

Comprehensive state transition test covering both valid and invalid transitions, with proper verification that labels are set even for unexpected transitions.

fault-remediation/pkg/initializer/init.go (3)

51-60: LGTM!

Good defensive validation ensuring the ctrl-runtime client is provided when UseCtrlRuntime is true. This prevents nil pointer panics downstream.


83-115: LGTM!

The dual-mode initialization properly separates the ctrl-runtime and k8s client paths, fixing the previous issue where clientSet could be nil in ctrl-runtime mode. Each path now correctly initializes its required dependencies.


151-160: LGTM!

The reconciler configuration now correctly receives the abstracted RemediationClient and StateManager, enabling both initialization paths to work with the same reconciler interface.

fault-remediation/pkg/metrics/metrics.go (2)

15-27: LGTM!

Clean package reorganization with appropriate exported constants for CR status tracking.


33-92: LGTM!

Metrics properly exported and registered with the controller-runtime metrics registry. The naming follows Prometheus conventions with the fault_remediation_ prefix.

fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)

29-44: LGTM on interface design.

The interface provides a clean abstraction over remediation operations with appropriate accessors for annotation management and status checking. The TemplateData struct properly embeds config.MaintenanceResource for template rendering.

commons/pkg/statemanager/statemanager.go (1)

298-306: LGTM on struct and constructor.

Clean implementation following the same pattern as NewStateManager.

fault-remediation/pkg/annotation/annotation.go (2)

107-132: LGTM!

Clean implementation for clearing the remediation state annotation with proper nil-check and patch-based update.


134-169: LGTM!

Well-structured logic to remove a specific group, with automatic cleanup when no groups remain.

fault-remediation/main.go (3)

194-197: LGTM!

Good integration of the auditing round-tripper wrapper for request auditing in the ctrl-runtime path.


228-244: LGTM!

Proper initialization flow with deferred cleanup for the datastore components in the ctrl-runtime path.


122-136: LGTM!

Clean separation of the non-ctrl-runtime initialization with proper deferred cleanup.

fault-remediation/pkg/annotation/deprecated_annotation.go (2)

46-59: LGTM!

The retry logic with retry.OnError is correctly implemented, and wrapping errors with %w preserves the error chain for retry detection.


83-124: LGTM!

The signature update to return the Node object alongside the state is correctly implemented and aligns with the new interface definition. This enables owner-reference-based operations in remediation flows.

fault-remediation/pkg/remediation/remediation.go (1)

375-377: LGTM: Error handling aligns with PR objectives.

The Update calls at lines 375-377, 413-416, and 465-468 correctly return errors without wrapping, which will trigger retries in the reconciliation loop as intended by the PR title "retry on errors and throw errors to trigger retries."

Also applies to: 413-416, 465-468

fault-remediation/pkg/remediation/deprecated_remediation_test.go (1)

15-418: LGTM!

The test updates correctly reflect the refactoring from private to exported fields in the FaultRemediationClient struct, and the usage of the new events.HealthEventData type.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

1-1380: LGTM!

The e2e test suite is well-structured and follows coding guidelines:

  • Uses envtest for testing Kubernetes controllers as recommended
  • Correctly handles the updated GetRemediationState signature throughout
  • Comprehensive test coverage for various reconciliation scenarios
  • Proper metrics validation
fault-remediation/pkg/remediation/deprecated_remediation.go (2)

365-504: LGTM: Comprehensive metrics instrumentation.

The metrics instrumentation throughout RunLogCollectorJob provides good observability with appropriate labels (error types, node names, job outcomes).


254-264: LGTM: Proper AlreadyExists error handling.

The extraction of AlreadyExists handling into handleCRCreateAlreadyExists improves code organization and ensures consistent annotation updates when a CR already exists.

fault-remediation/pkg/reconciler/reconciler.go (4)

407-441: Verify error handling strategy is intentional.

The function returns different shouldCreateCR values depending on the error type:

  • Line 409: Returns true (allow creation) when GetRemediationState fails
  • Line 440: Returns false (prevent creation) when RemoveGroupFromState fails

This appears intentional (fail-open for read errors, fail-closed for write errors), but could benefit from inline comments explaining the reasoning.


170-193: LGTM: runLogCollector signature updated for controller-runtime integration.

The signature changes enable proper requeue handling via ctrl.Result and improve log collector job labeling with eventUID.


328-344: LGTM: Proper error aggregation with errors.Join.

The error handling ensures status updates are always attempted even when remediation fails, and properly aggregates multiple errors using errors.Join for comprehensive error reporting.


101-352: LGTM: Comprehensive metrics instrumentation.

Metrics are consistently recorded throughout the reconciliation flow with appropriate labels for error types, node names, and status values, providing good observability.

fault-remediation/pkg/reconciler/reconciler_test.go (3)

40-66: LGTM: Mock interfaces updated to match new signatures.

The mock implementations correctly reflect the updated interface methods with events.HealthEventData, ctrl.Result return types, and new annotation/status checker interfaces.


102-134: LGTM: Mock annotation manager correctly implements new interface.

The mock properly returns the expanded 3-tuple from GetRemediationState and uses the correct types from the annotation package.


190-1009: LGTM: Test cases comprehensively updated for new interfaces.

All test cases correctly use events.HealthEventData and events.HealthEventDoc types, mock the new return signatures, and validate the updated error handling and return patterns.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)

45-60: Return unwrapped errors in retry blocks to preserve retry behavior.

Line 55 wraps the error inside a retry.OnError block. Per coding guidelines, errors should be returned without wrapping within retry blocks to preserve retry behavior, as wrapping can interfere with error type checking.

πŸ”§ Proposed fix
 func (m *NodeAnnotationManager) patchNodeWithRetry(ctx context.Context, nodeName string, patch []byte) error {
-	return retry.OnError(retry.DefaultRetry, isRetryableError, func() error {
+	err := retry.OnError(retry.DefaultRetry, isRetryableError, func() error {
 		_, err := m.kubeClient.CoreV1().Nodes().Patch(ctx, nodeName, types.MergePatchType, patch, metav1.PatchOptions{})
 		if err != nil && isRetryableError(err) {
 			slog.Warn("Retryable error patching node annotation. Retrying...",
 				"node", nodeName,
 				"error", err)
 		}
 
-		if err != nil {
-			return fmt.Errorf("failed to patch node %s: %w", nodeName, err)
-		}
-
-		return nil
+		return err
 	})
+	if err != nil {
+		return fmt.Errorf("failed to patch node %s: %w", nodeName, err)
+	}
+	return nil
 }

Based on coding guidelines.

fault-remediation/pkg/remediation/deprecated_remediation_test.go (1)

225-235: Correct template filename capitalization to match actual file.

The code references "rebootnode-Template.yaml" (capital T), but the actual template file is rebootnode-template.yaml (lowercase). Update line 227 to use the correct lowercase filename.

πŸ€– Fix all issues with AI agents
In @fault-remediation/pkg/annotation/annotation.go:
- Around line 66-73: The code after the GetRemediationState error return is
dead: remove the unreachable initialization of state (the
RemediationStateAnnotation with EquivalenceGroups) that appears after "return
err"; if your intent was to recover and use a default state instead of returning
an error, assign state = &RemediationStateAnnotation{EquivalenceGroups:
make(map[string]EquivalenceGroupState)} before returning and change the "return
err" to "return nil" (and update callers accordingly); reference symbols:
m.GetRemediationState, state, RemediationStateAnnotation, EquivalenceGroups,
EquivalenceGroupState.

In @fault-remediation/pkg/crstatus/deprecated_checker.go:
- Around line 77-91: In CRStatusChecker.checkCondition, the skip/allow logic is
inverted: when status or conditions are missing you should treat the resource as
non-terminal (do not skip) and when a conditionStatus is terminal you should
skip. Change the two early returns that currently return true to return false
(for the status/conditions-not-found or error cases), and change the final
return from !c.isTerminal(conditionStatus) to c.isTerminal(conditionStatus);
this uses the existing methods findConditionStatus and isTerminal to determine
terminality correctly.

In @fault-remediation/pkg/reconciler/reconciler_test.go:
- Around line 48-50: The local interface named CRStatusCheckerInterface is
unused and duplicates crstatus.CRStatusCheckerInterface used by the mock; remove
the local type definition of CRStatusCheckerInterface from the test file and
ensure all references (e.g., the mock that currently expects
crstatus.CRStatusCheckerInterface) use the package-scoped
crstatus.CRStatusCheckerInterface, updating imports if necessary so the test
only relies on crstatus.CRStatusCheckerInterface.

In @fault-remediation/pkg/remediation/deprecated_remediation.go:
- Around line 266-293: The code performs two redundant annotation updates: first
calling c.AnnotationManager.UpdateRemediationState with crName, then again with
actualCRName after obtaining actualCRName := createdCR.GetName(); remove the
first update block (the if that checks group != "" and calls
UpdateRemediationState with crName) and the redundant reassignment of group
before the second update, leaving only the single UpdateRemediationState call
that uses actualCRName (keep references to common.GetRemediationGroupForAction,
createdCR.GetName, and c.AnnotationManager.UpdateRemediationState) so the node
annotation is set once with the real CR name and the function still returns
actualCRName, nil.

In @fault-remediation/pkg/remediation/remediation_test.go:
- Around line 56-58: The test calls NewCtrlRuntimeRemediationClient with a
TemplateData that sets TemplateFileName to "rebootnode-Template.yaml" which
mismatches the actual file name; change the TemplateData.TemplateFileName value
to "rebootnode-template.yaml" (lowercase "template") so the test loads the
correct template file.

In @fault-remediation/pkg/remediation/remediation.go:
- Around line 356-376: The nil-check logic incorrectly skips metrics when
job.Annotations is nil; change the guard from "job.Annotations != nil &&
job.Annotations[jobMetricsAlreadyCountedAnnotation] != trueStringVal" to
"job.Annotations == nil || job.Annotations[jobMetricsAlreadyCountedAnnotation]
!= trueStringVal" so freshly-created jobs without annotations still enter the
block (you can keep the existing updateJob.Annotations initialization). Apply
this same fix in the analogous checks inside checkLogCollectorFailed and
checkLogCollectorTimedOut so they use the same "annotations == nil ||
annotations[key] != trueStringVal" condition.
🧹 Nitpick comments (21)
commons/pkg/statemanager/statemanager_test.go (1)

508-508: Use consistent context creation.

This test uses context.TODO() while all other test functions in this file use context.Background() (lines 270, 282, 310, 341, 370, 399, 441). Maintain consistency across the test suite.

♻️ Proposed fix
-		ctx := context.TODO()
+		ctx := context.Background()
commons/pkg/statemanager/statemanager.go (1)

387-387: Prefer explicit nil return for clarity.

At this point in the code, err is guaranteed to be nil (any non-nil error would have triggered an early return at line 374). For clarity and explicitness, consider returning true, nil directly.

♻️ Proposed fix
-	return true, err
+	return true, nil
fault-remediation/pkg/remediation/deprecated_remediation.go (1)

506-517: Consider returning errors instead of swallowing them for proper retry handling.

Lines 506-517 swallow errors from log collector job timeout and failure by returning ctrl.Result{}, nil. While the comments indicate this is intentional to allow reconciliation to continue, completely swallowing errors prevents proper retry/backoff behavior and can mask issues.

Consider returning the error with a requeue request:

return ctrl.Result{Requeue: true, RequeueAfter: 5*time.Minute}, fmt.Errorf(...)

This allows the controller-runtime to apply exponential backoff and proper error tracking while still continuing reconciliation.

fault-remediation/pkg/crstatus/crstatus_test.go (1)

120-212: Eliminate test duplication with a parameterized helper.

The entire test function TestCheckConditionCtrlRuntime duplicates TestCheckCondition with only the checker initialization differing. This violates DRY and creates a maintenance burden.

♻️ Refactor to use table-driven test with multiple checkers
-func TestCheckConditionCtrlRuntime(t *testing.T) {
-	cfg := &config.MaintenanceResource{
-		CompleteConditionType: "Completed",
-	}
-	checker := NewCtrlRuntimeCRStatusChecker(nil, cfg, false)
-
-	tests := []struct {
-		name     string
-		cr       *unstructured.Unstructured
-		expected bool
-	}{
-		// ... duplicate test cases ...
-	}
-
-	for _, tt := range tests {
-		t.Run(tt.name, func(t *testing.T) {
-			result := checker.checkCondition(tt.cr)
-			assert.Equal(t, tt.expected, result)
-		})
-	}
-}
+func TestCheckCondition(t *testing.T) {
+	cfg := &config.MaintenanceResource{
+		CompleteConditionType: "Completed",
+	}
+
+	checkers := []struct {
+		name    string
+		checker interface{ checkCondition(*unstructured.Unstructured) bool }
+	}{
+		{
+			name:    "K8s",
+			checker: NewCRStatusChecker(nil, nil, cfg, false),
+		},
+		{
+			name:    "CtrlRuntime",
+			checker: NewCtrlRuntimeCRStatusChecker(nil, cfg, false),
+		},
+	}
+
+	tests := []struct {
+		name     string
+		cr       *unstructured.Unstructured
+		expected bool
+	}{
+		// ... single set of test cases ...
+	}
+
+	for _, c := range checkers {
+		t.Run(c.name, func(t *testing.T) {
+			for _, tt := range tests {
+				t.Run(tt.name, func(t *testing.T) {
+					result := c.checker.checkCondition(tt.cr)
+					assert.Equal(t, tt.expected, result)
+				})
+			}
+		})
+	}
+}

Based on coding guidelines, as per the learnings for table-driven tests when testing multiple scenarios in Go.

fault-remediation/pkg/remediation/remediation_test.go (1)

114-126: Make test name more descriptive.

The test name "dry run" is less descriptive than recommended. Consider following the pattern from the past review comment suggestion.

πŸ“ Suggested improvement
 		{
-			name:              "dry run",
+			name:              "Successful rebootnode creation - dry run",
 			nodeName:          "test-node-1",
 			dryRun:            true,
 			recommendedAction: protos.RecommendedAction_RESTART_BM,

Based on coding guidelines, as per the naming format TestFunctionName_Scenario_ExpectedBehavior for Go tests.

fault-remediation/pkg/initializer/init.go (2)

89-115: Extract TemplateData construction to reduce duplication.

The TemplateData construction (lines 92-96 and 105-109) is duplicated in both initialization branches. This creates a maintenance burden if the template configuration structure changes.

♻️ Extract template data construction
+	templateData := remediation.TemplateData{
+		TemplateMountPath:   tomlConfig.Template.MountPath,
+		TemplateFileName:    tomlConfig.Template.FileName,
+		MaintenanceResource: tomlConfig.MaintenanceResource,
+	}
+
 	if params.UseCtrlRuntime {
 		remediationClient, err = remediation.NewCtrlRuntimeRemediationClient(
 			ctrlruntimeClient,
-			params.DryRun, remediation.TemplateData{
-				TemplateMountPath:   tomlConfig.Template.MountPath,
-				TemplateFileName:    tomlConfig.Template.FileName,
-				MaintenanceResource: tomlConfig.MaintenanceResource,
-			})
+			params.DryRun,
+			templateData)
 		if err != nil {
 			return nil, fmt.Errorf("error while initializing ctrl runtime client: %w", err)
 		}
 		stateManager = statemanager.NewCtrlRuntimeStateManager(ctrlruntimeClient)
 	} else {
 		remediationClient, clientSet, err = remediation.NewK8sClient(
 			params.KubeconfigPath,
 			params.DryRun,
-			remediation.TemplateData{
-				TemplateMountPath:   tomlConfig.Template.MountPath,
-				TemplateFileName:    tomlConfig.Template.FileName,
-				MaintenanceResource: tomlConfig.MaintenanceResource,
-			},
+			templateData,
 		)

117-117: Make log message more informative about initialization mode.

The generic log message doesn't indicate which client type was initialized, making it harder to debug initialization issues.

πŸ“ Proposed improvement
-	slog.Info("Successfully initialized client")
+	if params.UseCtrlRuntime {
+		slog.Info("Successfully initialized ctrl-runtime remediation client")
+	} else {
+		slog.Info("Successfully initialized Kubernetes remediation client")
+	}
fault-remediation/pkg/remediation/remediation.go (1)

357-367: Error from annotation update causes early return without recording metrics.

If c.client.Update(ctx, updateJob) fails at line 364, the function returns false, err without recording the success metric. This means a successful job completion may not be tracked if the annotation update fails. Consider recording metrics before attempting the annotation update.

♻️ Proposed refactor
 	if completeCondition != nil && completeCondition.Status == metav1.ConditionTrue {
 		slog.Info("Log collector job completed successfully", "job", job.Name)
-		// Use job's actual duration instead of custom tracking
-		// reconciliation can be called multiple times so use annotation to make sure we're not duplicate recording metrics
-		if job.Annotations != nil && job.Annotations[jobMetricsAlreadyCountedAnnotation] != trueStringVal {
+		if job.Annotations == nil || job.Annotations[jobMetricsAlreadyCountedAnnotation] != trueStringVal {
+			duration := job.Status.CompletionTime.Sub(job.Status.StartTime.Time).Seconds()
+			metrics.LogCollectorJobs.WithLabelValues(nodeName, "success").Inc()
+			metrics.LogCollectorJobDuration.WithLabelValues(nodeName, "success").Observe(duration)
+
 			updateJob := job.DeepCopy()
 			if updateJob.Annotations == nil {
 				updateJob.Annotations = map[string]string{}
 			}
-
 			updateJob.Annotations[jobMetricsAlreadyCountedAnnotation] = trueStringVal
-
-			err := c.client.Update(ctx, updateJob)
-			if err != nil {
-				return false, err
+			if err := c.client.Update(ctx, updateJob); err != nil {
+				slog.Warn("Failed to mark job metrics as recorded", "job", job.Name, "error", err)
+				// Continue - metrics already recorded
 			}
-
-			duration := job.Status.CompletionTime.Sub(job.Status.StartTime.Time).Seconds()
-
-			metrics.LogCollectorJobs.WithLabelValues(nodeName, "success").Inc()
-			metrics.LogCollectorJobDuration.WithLabelValues(nodeName, "success").Observe(duration)
 		}
-
 		return true, nil
 	}
fault-remediation/pkg/remediation/deprecated_remediation_test.go (1)

17-40: Import grouping does not follow Go conventions.

The events import at line 19 is placed between standard library imports and third-party imports. Go convention is to group imports: standard library, then external packages, then internal packages.

♻️ Proposed fix
 import (
 	"context"
-	"github.com/nvidia/nvsentinel/fault-remediation/pkg/events"
 	"testing"
 	"text/template"

 	"github.com/google/uuid"
 	"github.com/stretchr/testify/assert"
 	corev1 "k8s.io/api/core/v1"
 	metameta "k8s.io/apimachinery/pkg/api/meta"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
 	"k8s.io/apimachinery/pkg/runtime/schema"
 	"k8s.io/client-go/discovery"
 	"k8s.io/client-go/discovery/cached/memory"
 	"k8s.io/client-go/dynamic"
 	"k8s.io/client-go/kubernetes/fake"
 	"k8s.io/client-go/rest"
 	"k8s.io/client-go/restmapper"

 	"github.com/nvidia/nvsentinel/data-models/pkg/model"
 	"github.com/nvidia/nvsentinel/data-models/pkg/protos"
 	"github.com/nvidia/nvsentinel/fault-remediation/pkg/config"
+	"github.com/nvidia/nvsentinel/fault-remediation/pkg/events"
 )
fault-remediation/pkg/reconciler/reconciler_test.go (1)

17-38: Import grouping does not follow Go conventions.

Imports are mixed: errors and testing/time are separated by internal package imports. Group standard library imports together, followed by external packages, then internal packages.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

17-54: Import grouping does not follow Go conventions.

Imports are scattered with blank lines in unexpected places (lines 19, 23). Standard library, external, and internal packages should be in separate groups.


168-179: Consider consolidating package-level test variables.

Multiple package-level variables are declared across lines 168-179. Consider grouping related variables or using a test fixture struct to improve readability and make test setup/teardown clearer.

fault-remediation/pkg/reconciler/reconciler.go (9)

183-190: Consider error wrapping for context.

The error from RunLogCollectorJob is returned without additional context. As per coding guidelines, errors should be wrapped using fmt.Errorf("context: %w", err) to provide clarity about where the failure occurred.

πŸ“ Proposed error wrapping
 	result, err := r.Config.RemediationClient.RunLogCollectorJob(ctx, healthEvent.NodeName, eventUID)
 	if err != nil {
 		slog.Error("Log collector job failed for node",
 			"node", healthEvent.NodeName,
 			"error", err)
 
-		return ctrl.Result{}, err
+		return ctrl.Result{}, fmt.Errorf("log collector job failed for node %s: %w", healthEvent.NodeName, err)
 	}

203-211: Add error context for label update failure.

The error from UpdateNVSentinelStateNodeLabel is returned without wrapping. Adding context will make debugging easier when this operation fails during the "remediating" state update.

πŸ“ Proposed error wrapping
 	_, err := r.Config.StateManager.UpdateNVSentinelStateNodeLabel(ctx,
 		healthEventWithStatus.HealthEvent.NodeName,
 		statemanager.RemediatingLabelValue, false)
 	if err != nil {
 		slog.Error("Error updating node label to remediating", "error", err)
 		metrics.ProcessingErrors.WithLabelValues("label_update_error", nodeName).Inc()
 
-		return "", err
+		return "", fmt.Errorf("failed to update node label to remediating for node %s: %w", nodeName, err)
 	}

220-244: Good error handling pattern but consider adding more context.

The deferred state update pattern (attempting to set label to "failed" even when CR creation fails) is solid and aligns with the PR objective to handle errors properly. However, the returned errors lack context.

πŸ“ Enhanced error wrapping for clarity
 	_, err = r.Config.StateManager.UpdateNVSentinelStateNodeLabel(ctx,
 		healthEventWithStatus.HealthEvent.NodeName,
 		remediationLabelValue, false)
 	if err != nil {
 		slog.Error("Error updating node label",
 			"label", remediationLabelValue,
 			"error", err)
 		metrics.ProcessingErrors.WithLabelValues("label_update_error", nodeName).Inc()
 
-		return "", errors.Join(createMaintenanceResourceError, err)
+		labelErr := fmt.Errorf("failed to update node label to %s for node %s: %w", remediationLabelValue, nodeName, err)
+		return "", errors.Join(createMaintenanceResourceError, labelErr)
 	}
 
 	if createMaintenanceResourceError != nil {
-		return "", createMaintenanceResourceError
+		return "", fmt.Errorf("failed to create maintenance resource for node %s: %w", nodeName, createMaintenanceResourceError)
 	}

259-264: Add error context for state clearing failure.

The error from ClearRemediationState should be wrapped with context per coding guidelines to aid debugging.

πŸ“ Proposed error wrapping
 	if err := r.annotationManager.ClearRemediationState(ctx, nodeName); err != nil {
 		slog.Error("Failed to clear remediation state for node",
 			"node", nodeName,
 			"error", err)
 
-		return ctrl.Result{}, err
+		return ctrl.Result{}, fmt.Errorf("failed to clear remediation state for node %s: %w", nodeName, err)
 	}

267-272: Add error context for mark processed failure.

Per coding guidelines, wrap the error with context to clarify which operation failed.

πŸ“ Proposed error wrapping
 	if err := watcherInstance.MarkProcessed(context.Background(), resumeToken); err != nil {
 		metrics.ProcessingErrors.WithLabelValues("mark_processed_error", nodeName).Inc()
 		slog.Error("Error updating resume token", "error", err)
 
-		return ctrl.Result{}, err
+		return ctrl.Result{}, fmt.Errorf("failed to mark cancellation event as processed for node %s: %w", nodeName, err)
 	}

328-344: Good error propagation for retry behavior.

The handling correctly propagates errors from runLogCollector and performRemediation to trigger retries, and uses errors.Join to combine multiple failures. The pattern of updating remediation status even when performRemediation fails is solid.

However, for consistency with coding guidelines, consider adding context to the final error returns to clarify which operation failed in the remediation flow.


291-296: Pattern: Multiple MarkProcessed errors lack context.

These four locations return errors from MarkProcessed or checkExistingCRStatus without wrapping. While the error propagation is correct for retry behavior, adding context would improve debuggability.

Consider wrapping these errors with context about the operation and node name, similar to the pattern suggested in other review comments.

Also applies to: 302-307, 316-321, 348-353


364-367: Add error context for document ID extraction.

For consistency with the error handling at line 385 and coding guidelines, wrap this error with context.

πŸ“ Proposed error wrapping
 	documentID, err := utils.ExtractDocumentID(eventWithToken.Event)
 	if err != nil {
-		return err
+		return fmt.Errorf("failed to extract document ID from event: %w", err)
 	}

406-410: Add error context for annotation operations.

Both GetRemediationState and RemoveGroupFromState errors should be wrapped with context per coding guidelines.

πŸ“ Proposed error wrapping
 	state, _, err := r.annotationManager.GetRemediationState(ctx, nodeName)
 	if err != nil {
 		slog.Error("Error getting remediation state", "node", nodeName, "error", err)
-		return true, "", err
+		return true, "", fmt.Errorf("failed to get remediation state for node %s: %w", nodeName, err)
 	}

 	// ... later ...

 	if err = r.annotationManager.RemoveGroupFromState(ctx, nodeName, group); err != nil {
 		slog.Error("Failed to remove CR from annotation", "error", err)
-		return false, "", err
+		return false, "", fmt.Errorf("failed to remove group %s from remediation state for node %s: %w", group, nodeName, err)
 	}

Also applies to: 438-441

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 8788f69 and 1b3ff7f.

β›” Files ignored due to path filters (1)
  • commons/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (26)
  • .gitignore
  • commons/go.mod
  • commons/pkg/statemanager/statemanager.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ’€ Files with no reviewable changes (1)
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
βœ… Files skipped from review due to trivial changes (1)
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • commons/go.mod
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/initializer/init.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • commons/pkg/statemanager/statemanager.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧠 Learnings (23)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_interface.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/metrics/metrics.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/metrics/metrics.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • commons/pkg/statemanager/statemanager.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/main.go
  • commons/pkg/statemanager/statemanager.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use separate informers for different Kubernetes resource types

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧬 Code graph analysis (13)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (3)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (9-11)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/initializer/init.go (3)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
  • FaultRemediationClientInterface (29-34)
  • TemplateData (37-44)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewCtrlRuntimeRemediationClient (47-95)
fault-remediation/pkg/config/config.go (2)
  • Template (27-30)
  • MaintenanceResource (18-24)
fault-remediation/pkg/crstatus/crstatus_test.go (2)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/crstatus/checker.go (1)
  • NewCtrlRuntimeCRStatusChecker (34-44)
fault-remediation/main.go (4)
fault-remediation/pkg/initializer/init.go (2)
  • InitializationParams (38-44)
  • InitializeAll (51-168)
fault-remediation/pkg/reconciler/reconciler.go (1)
  • FaultRemediationReconciler (61-69)
commons/pkg/auditlogger/roundtripper.go (1)
  • NewAuditingRoundTripper (42-47)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (41-43)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/remediation/remediation_test.go (5)
fault-remediation/pkg/remediation/remediation.go (2)
  • NewCtrlRuntimeRemediationClient (47-95)
  • CtrlRuntimeRemediationClient (38-45)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • TemplateData (37-44)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_RESTART_BM (48-48)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
fault-remediation/pkg/remediation/deprecated_remediation.go (11)
fault-remediation/pkg/config/config.go (1)
  • Template (27-30)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • TemplateData (37-44)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/crstatus/deprecated_checker.go (2)
  • CRStatusChecker (29-34)
  • NewCRStatusChecker (36-48)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
  • NewNodeAnnotationManager (38-42)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (9-11)
fault-remediation/pkg/reconciler/reconciler_test.go (1)
  • CRStatusCheckerInterface (48-50)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
platform-connectors/pkg/pipeline/factory.go (1)
  • Create (32-39)
fault-remediation/pkg/common/equivalence_groups.go (1)
  • GetRemediationGroupForAction (35-45)
fault-remediation/pkg/metrics/metrics.go (3)
  • LogCollectorErrors (86-92)
  • LogCollectorJobs (71-77)
  • LogCollectorJobDuration (78-85)
fault-remediation/pkg/crstatus/deprecated_checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/remediation/deprecated_remediation_test.go (4)
fault-remediation/pkg/config/config.go (1)
  • Template (27-30)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • TemplateData (37-44)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
fault-remediation/pkg/remediation/deprecated_remediation.go (1)
  • FaultRemediationClient (62-74)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/reconciler/reconciler_test.go (4)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventData (12-15)
  • HealthEventDoc (6-9)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • NodeAnnotationManagerInterface (16-21)
  • RemediationStateAnnotation (24-26)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (9-11)
data-models/pkg/protos/health_event.pb.go (2)
  • RecommendedAction_RESTART_BM (48-48)
  • RecommendedAction_NONE (43-43)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (4)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewCtrlRuntimeRemediationClient (47-95)
commons/pkg/statemanager/statemanager.go (4)
  • NewStateManager (209-213)
  • StateManager (200-203)
  • DrainSucceededLabelValue (170-170)
  • RemediatingLabelValue (174-174)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • AnnotationKey (12-12)
fault-remediation/pkg/metrics/metrics.go (6)
  • TotalEventsReceived (33-38)
  • EventHandlingDuration (62-68)
  • EventsProcessed (39-45)
  • CRStatusCreated (25-25)
  • CRStatusSkipped (26-26)
  • ProcessingErrors (46-52)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (38)
commons/pkg/statemanager/statemanager_test.go (1)

269-549: Comprehensive test coverage for controller-runtime state manager.

These tests provide excellent coverage of the controller-runtime state manager implementation, mirroring the existing client-go tests and ensuring both implementations behave consistently. The table-driven approach for state transition validation is particularly well-structured.

commons/pkg/statemanager/statemanager.go (2)

298-306: Clean constructor pattern for controller-runtime state manager.

The type declaration and constructor follow established patterns and properly return the StateManager interface, enabling flexible client selection.


310-388: Well-structured controller-runtime implementation.

The implementation properly mirrors the client-go version while adapting to controller-runtime patterns. The error handling, logging, and validation logic are consistent with the existing state manager. As discussed in previous reviews, relying on controller-runtime's reconciliation retry mechanism is an appropriate design choice for this context.

fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

15-15: YAMLlint error is a false positive for Go template syntax.

The yamllint syntax error on line 15 is expected because {{.ApiGroup}}/{{.Version}} is Go template syntax that will be rendered at runtime. This is not a valid concern.

fault-remediation/pkg/annotation/deprecated_annotation.go (1)

83-125: LGTM: GetRemediationState signature update is consistent.

The updated signature correctly returns the node alongside the state, and all return paths properly include the node value. Callers appropriately handle the extra return value.

fault-remediation/pkg/remediation/deprecated_remediation.go (2)

254-316: LGTM: AlreadyExists handling is well-structured.

The new handleCRCreateAlreadyExists helper properly handles the case where a CR already exists by updating the node annotation and logging appropriately. The separation into a helper method improves readability.


379-385: No issue found - constants are properly defined in the same package.

The constants logCollectorNodeLabel and logCollectorEventLabel are defined in remediation.go (lines 32-33) within the same package. In Go, package-level constants defined in one file are accessible from all other files in the same package without explicit imports. This code will compile successfully.

Likely an incorrect or invalid review comment.

fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)

29-34: LGTM: Clean interface design.

The FaultRemediationClientInterface provides a well-defined contract with appropriate method signatures for remediation operations, annotation management, and status checking.


36-44: LGTM: Well-structured TemplateData definition.

The TemplateData struct properly embeds config.MaintenanceResource and includes all necessary fields for template rendering and health event tracking.

fault-remediation/pkg/crstatus/crstatus_interface.go (1)

1-11: LGTM! Package documentation and interface are well-defined.

The package-level godoc has been added as requested in the previous review, and the interface follows Go conventions with a clear, focused contract.

fault-remediation/pkg/metrics/metrics.go (1)

15-93: LGTM! Metric exports and package reorganization are correct.

The package rename and metric variable exports follow Go conventions. The TODO comment at line 30 was already flagged in a previous review and is outside the scope of these changes.

fault-remediation/main.go (3)

122-136: LGTM!

The non-controller-runtime setup properly initializes components, sets up deferred cleanup, and handles the metrics server and event processing. The structure is clean and follows the expected patterns.


228-239: LGTM!

The controller-runtime setup correctly initializes components with the manager's client and properly defers cleanup. The wiring with mgr.GetClient() aligns with the controller-runtime pattern.


194-197: The current code is correct. rest.Config.Wrap() is a convenience method that modifies the config's transport wrapping in-place by composing wrappers onto WrapTransport. It does not return a value requiring reassignment. The approach used here is the idiomatic way to add HTTP transport wrappers in client-go, and it properly stacks multiple wrappers when needed.

fault-remediation/pkg/annotation/annotation.go (3)

15-25: LGTM!

The struct and constructor follow Go conventions and properly initialize the manager with the controller-runtime client.


106-131: LGTM!

ClearRemediationState properly handles the nil annotations case and uses the merge-from patch pattern correctly.


133-168: LGTM!

RemoveGroupFromState correctly handles the case where no groups remain by delegating to ClearRemediationState, and properly propagates errors.

fault-remediation/pkg/remediation/remediation.go (3)

46-95: LGTM!

The constructor properly validates template existence, parses the template, initializes dry-run mode, and sets up annotation manager and status checker. Error handling is thorough with context-wrapped errors.


105-200: LGTM!

CreateMaintenanceResource properly handles dry-run mode, creates owner references for garbage collection, handles IsAlreadyExists errors gracefully, and updates node annotation with the actual CR name. The previous duplicate annotation update issue has been addressed.


246-309: LGTM!

launchLogCollectorJob properly reads the manifest, sets labels for deduplication, handles the case of multiple existing jobs, and requeues after creation to check status later.

fault-remediation/pkg/remediation/deprecated_remediation_test.go (2)

347-369: LGTM!

Test correctly updated to use events.HealthEventData and handles the new (string, error) return signature from CreateMaintenanceResource.


396-418: LGTM!

Tests properly updated to use eventId parameter and handle the (ctrl.Result, error) return type from RunLogCollectorJob.

fault-remediation/pkg/crstatus/checker.go (3)

28-44: LGTM!

The struct and constructor properly initialize the status checker with the controller-runtime client and configuration.


58-69: LGTM!

The ObjectKey now correctly includes both Name and Namespace from the config, addressing the previous review concern about missing namespace for namespaced CRs.


71-105: LGTM!

The condition checking logic correctly traverses the unstructured status map, finds the configured condition type, and determines if the status is terminal.

fault-remediation/pkg/reconciler/reconciler_test.go (3)

40-66: LGTM!

MockK8sClient properly updated with new method signatures using events.HealthEventData, ctrl.Result, and the correct interface types for annotation manager and status checker.


102-134: LGTM!

MockNodeAnnotationManager properly implements the updated interface with the new GetRemediationState signature returning (*annotation.RemediationStateAnnotation, *corev1.Node, error).


601-666: LGTM!

TestRunLogCollectorJobErrorScenarios is a well-structured table-driven test that covers success, failure, and requeue scenarios with proper assertions on the ctrl.Result return type.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (4)

208-217: Manager created but not started before using GetClient.

The manager is created at lines 208-213 and GetClient() is called at line 217, but the manager isn't started until the goroutine at lines 250-254. While controller-runtime allows this pattern, be aware that the client's cache won't be synced until the manager starts, which could cause timing issues in tests.


302-315: LGTM!

createTestRemediationClient properly uses remediation.NewCtrlRuntimeRemediationClient with the shared controller-runtime client and returns the interface type. Template data configuration is correct.


880-888: LGTM!

Metrics assertions properly use the dedicated metrics package symbols (metrics.TotalEventsReceived, metrics.EventsProcessed, etc.) and verify the expected behavior for CR creation and deduplication.


1312-1328: LGTM!

cleanupNodeAnnotations helper properly uses annotation.AnnotationKey for cleanup, maintaining consistency with the rest of the codebase.

fault-remediation/pkg/reconciler/reconciler.go (6)

25-28: LGTM: Clean interface-based refactoring.

The addition of new internal packages and the shift to interface-based dependencies (RemediationClient, annotationManager) improves testability and modularity. The exposed Config field and initialization flow are consistent.

Also applies to: 51-51, 66-67, 83-84


92-127: LGTM: Proper event parsing and routing.

The Reconcile method correctly records metrics, handles parse errors by marking them processed (preventing infinite retries on malformed events), and routes to appropriate handlers based on node quarantine status.


156-165: Verify: Should label update errors trigger retry?

The error from UpdateNVSentinelStateNodeLabel is logged and recorded in metrics but not propagated. Given the PR's goal to "throw errors to trigger retries", consider whether label update failures should prevent event processing and trigger a retry.

If the label update is critical for cluster state consistency, propagating the error might be appropriate. The current implementation could lead to events being skipped even when state updates fail.

Based on the PR objectives and the importance of label consistency, should this error be propagated to the caller?


383-386: LGTM: Proper error wrapping with context.

This correctly follows the coding guideline to wrap errors with context using fmt.Errorf("context: %w", err).


448-498: LGTM: Robust error handling with proper classification.

The method correctly:

  • Wraps errors with context (line 491)
  • Records metrics with appropriate error labels based on failure type
  • Marks malformed events as processed to prevent queue blocking
  • Returns structured HealthEventDoc type

503-584: LGTM: Clean utility methods.

The utility methods properly handle:

  • Watcher lifecycle management
  • Error aggregation in CloseAll using errors.Join
  • Controller-runtime integration with proper channel adaptation
  • Goroutine cleanup on context cancellation or channel closure

@ivelichkovich ivelichkovich force-pushed the remediationerrors branch 4 times, most recently from 6806866 to 560b1dd Compare January 9, 2026 00:34
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
fault-remediation/pkg/initializer/init.go (1)

38-43: Blocker: InitializationParams.config is unexported, likely breaking callers outside initializer.
If InitializeAll is called from another package (typical), InitializationParams{config: ...} won’t compile. Rename/export it (e.g., RestConfig *rest.Config).

Proposed diff
 type InitializationParams struct {
-	config             *rest.Config
+	RestConfig         *rest.Config
 	TomlConfigPath     string
 	DryRun             bool
 	EnableLogCollector bool
 }
fault-remediation/pkg/crstatus/checker.go (1)

46-74: Incomplete comment and potential namespace handling for cluster-scoped resources.

  1. Line 50 has an incomplete comment: // note, if β€” this should be completed or removed.

  2. Line 67 uses resource.Namespace for the ObjectKey, but the config.MaintenanceResource struct has a Scope field that indicates whether the resource is cluster-scoped or namespaced. For cluster-scoped resources, ensure that Namespace is empty; otherwise, the lookup may fail silently. The current code logs and returns false (allowing creation), which provides a safe fallback, but you may want to explicitly handle scope.

Suggested fix for the incomplete comment
-	return true // note, if
+	return true // No configuration found; skip CR creation
πŸ€– Fix all issues with AI agents
In @.gitignore:
- Line 143: The .gitignore now contains a broad ".idea/" rule but still keeps
many specific .idea/* entries (e.g., ".idea/replstate.xml", ".idea/sonarlint/",
".idea/sonarIssues.xml"), causing redundancy and making the PR summary
inaccurate; fix by either removing all the specific .idea/* lines so the single
".idea/" entry covers them, or remove the general ".idea/" entry and keep the
explicit granular rulesβ€”apply the chosen approach consistently in the .gitignore
and update the PR summary to accurately reflect the consolidation or retained
granularity.

In @fault-remediation/main.go:
- Line 111: Replace the bare TODO comment "//TODO: setup informers for node and
job" with a TODO that references a tracked issue (e.g., include the issue number
or full issue URL) so it reads something like "// TODO: setup informers for node
and job β€” tracked in ISSUE-1234" (or a repo issue URL); ensure the comment
includes the canonical issue identifier per project guidelines and keep the same
intent text.

In @fault-remediation/pkg/annotation/annotation_test.go:
- Around line 146-168: The test TestClearRemediationState calls
AnnotationManager.ClearRemediationState and then immediately reassigns err when
calling client.Get, so the result of ClearRemediationState is never asserted;
after invoking annotationManager.ClearRemediationState(context.TODO(), nodeName)
capture and assert the call succeeded (e.g., assert.NoError(t, err) or
require.NoError(t, err)) before you reuse err for the client.Get call to ensure
failures in ClearRemediationState are detected.

In @fault-remediation/pkg/annotation/annotation.go:
- Around line 63-73: In UpdateRemediationState, the code assigns a default state
on the error path then immediately returns the error, making the assignment
dead/unused; either remove the unreachable assignment (delete the state =
&RemediationStateAnnotation{...} line) so the function simply logs and returns
the error, or if the intended behavior is to continue with a default state
instead of returning, initialize state and node to sane defaults and clear err
(do not return) so the rest of UpdateRemediationState can proceed; refer to
UpdateRemediationState, the local variables state/node/err and the call to
GetRemediationState to locate the change.
- Line 40: Replace the bare TODO comment in annotation.go ("TODO: maybe split
this up so it's not returning both node and state") with a TODO that references
the tracked issue ID or URL (e.g., "TODO: track in ISSUE-1234" or "TODO:
https://.../issues/1234") so the note points to a concrete ticket; update the
comment near the function or return logic that currently returns both node and
state (where the TODO is located) to include that issue reference.
- Around line 27-60: In GetRemediationState the JSON unmarshal error is
currently only logged and the function returns a nil error, hiding corrupt
annotations; change the handler for json.Unmarshal failure (in the block
referencing AnnotationKey and RemediationStateAnnotation) to return a wrapped
error (e.g. fmt.Errorf("failed to unmarshal remediation annotation for node %s:
%w", nodeName, err)) so the caller can retry/fail, or alternatively attempt to
clear the bad annotation via the client (m.client.Update) and return a clear
success/error outcomeβ€”ensure you return a non-nil error when unmarshal fails and
keep references to node and nodeName for context.

In @fault-remediation/pkg/initializer/init.go:
- Around line 82-97: The remediation client error message is misleading and the
kube client error is returned unwrapped; update the
remediation.NewRemediationClient error handling to return a clear, specific
message like "error initializing remediation client" (or similar) instead of
"ctrl runtime client", and wrap the kubernetes.NewForConfig error with context
using fmt.Errorf("error creating kube client: %w", err) so both failures provide
actionable context; adjust the return statements around
remediation.NewRemediationClient and kubernetes.NewForConfig accordingly in
init.go.

In @fault-remediation/pkg/reconciler/reconciler_test.go:
- Around line 49-51: Remove the locally defined CRStatusCheckerInterface type
declaration and replace any local usages with the imported
crstatus.CRStatusCheckerInterface; delete the type block "type
CRStatusCheckerInterface interface { IsSuccessful(ctx context.Context, crName
string) bool }" and ensure all references in the test (e.g., mock variables,
function signatures) use crstatus.CRStatusCheckerInterface, and if necessary
adjust imports to avoid unused-import or missing-symbol errors.

In @fault-remediation/pkg/remediation/fault_remediation_client_interface.go:
- Around line 29-35: Exported type FaultRemediationClientInterface lacks a godoc
comment; add a one-line Go doc comment immediately above the type declaration
that begins with "FaultRemediationClientInterface" and briefly describes its
purpose and role (e.g., that it defines methods for creating maintenance
resources, running log collector jobs, and providing access to annotation
manager, status checker, and config). Ensure the comment follows Go convention
(starts with the type name) and references the interface as a whole; leave the
existing method signatures unchanged.

In @fault-remediation/pkg/remediation/remediation_test.go:
- Around line 526-527: The test assertion uses assert.Equal with arguments
reversed; change the call in remediation_test.go from assert.Equal(t,
result.RequeueAfter, tt.requeueTime) to assert.Equal(t, tt.requeueTime,
result.RequeueAfter) so the expected value (tt.requeueTime) is first and the
actual (result.RequeueAfter) is second for clearer failure output.
- Around line 29-97: The table-driven test leaves tt.client nil which makes
TestNewCtrlRuntimeClient brittle; either populate tt.client with a real fake
controller-runtime client before calling NewRemediationClient or remove the
unused client field from the test cases. To fix, in TestNewCtrlRuntimeClient
initialize tt.client using controller-runtime's fake client builder (e.g.,
fake.NewClientBuilder().WithScheme(yourScheme).WithObjects(...).Build()) for
each subtest that needs a non-nil client and pass that into
NewRemediationClient, or simplify the table by deleting the client field and
always pass nil if the constructor is expected to accept nil.

In @fault-remediation/pkg/remediation/remediation.go:
- Around line 116-139: Validate and harden loadAndParseTemplate: ensure fileName
is a plain base name (no path separators or parent refs) and refuse values
containing "/" or "\" or ".." (or compare filepath.Base(fileName) == fileName),
then build the path and verify the resolved absolute path is inside the
mountPath root before reading; also set the template option to fail on missing
keys by calling tmpl = template.New(templateName).Option("missingkey=error")
prior to Parse so templates error on unknown data.
- Around line 43-114: Add proper godoc comments above the exported
FaultRemediationClient type and the NewRemediationClient function: for
FaultRemediationClient add a one-line summary describing its role (e.g.,
"FaultRemediationClient manages remediation templates, annotation manager and
status checking for remediation actions.") and for NewRemediationClient add a
summary plus brief parameter/return description (single sentence each)
explaining the client, dryRun flag, and remediationConfig parameters and what
the function returns (constructed *FaultRemediationClient or error). Ensure the
comments start immediately above the declarations and follow Go doc style (begin
with the symbol name).
- Around line 456-491: In FaultRemediationClient.checkLogCollectorComplete: fix
the annotation gating and nil-time panics by treating a missing annotations map
as "not counted" (i.e., always attempt to set jobMetricsAlreadyCountedAnnotation
when its value != trueStringVal), create or initialize updateJob.Annotations
before setting the flag, and perform nil checks for job.Status.StartTime and
job.Status.CompletionTime before computing duration (skip metrics or use safe
zero/early-return if either is nil); ensure you still call c.client.Update with
the patched annotation and only record metrics when duration is computed safely
and the annotation was not already true.

In @fault-remediation/pkg/remediation/templates/rebootnode-template.yaml:
- Around line 15-20: The YAML scalars that begin with template substitutions
must be quoted to avoid invalid YAML; update the template to quote the
apiVersion and nodeName values so they become apiVersion:
"{{.ApiGroup}}/{{.Version}}" and spec.nodeName: "{{.NodeName}}". Locate the
apiVersion line and the spec nodeName line in the RebootNode template and wrap
the entire substitution (including the slash) in double quotes; keep other
fields (e.g., metadata.name) unchanged unless they also begin with a brace.
🧹 Nitpick comments (11)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)

104-108: Drop unnecessary quotes for consistency.
path: "/readyz" and port: "health" work, but quoting is inconsistent with livenessProbe and adds noise.

Proposed diff
           readinessProbe:
             httpGet:
-              path:  "/readyz"
-              port: "health"
+              path: /readyz
+              port: health
fault-remediation/pkg/remediation/remediation_test.go (1)

22-24: Consider envtest for higher-fidelity behavior where it matters.
Repo guidance suggests envtest over fake clients for Kubernetes controller behavior; these tests validate controller-runtime client flows, status, and Jobs, so fake-client semantics may miss real API behavior.

Also applies to: 243-246, 505-509

fault-remediation/pkg/annotation/annotation_test.go (2)

15-202: Align test names with the repo’s naming convention.
Consider renaming to TestGetRemediationState_NodeNotFound_ReturnsError, etc., to match TestFunctionName_Scenario_ExpectedBehavior.


10-10: Consider envtest if these tests are intended to validate controller behavior.
Repo guidance prefers envtest over fake clients; if you’re relying on real API semantics for annotations/patches, fake client can diverge.

fault-remediation/main.go (1)

147-151: Cleanup runs with a cancelled ctx; use a fresh timeout context for CloseAll.
On SIGTERM, ctx is cancelled; reconciler.CloseAll(ctx) may be unable to close gracefully.

Proposed tweak
 defer func() {
-  if err := reconciler.CloseAll(ctx); err != nil {
+  shutdownCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+  defer cancel()
+  if err := reconciler.CloseAll(shutdownCtx); err != nil {
     slog.Error("failed to close datastore components", "error", err)
   }
 }()
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)

17-26: Non-standard import grouping.

The imports have inconsistent grouping with blank lines in unexpected places. Go convention groups imports into standard library, external packages, and internal packages, each separated by a single blank line.

Suggested import grouping
 import (
 	"context"
-
 	"log"
 	"os"
 	"path/filepath"
-	"sigs.k8s.io/controller-runtime/pkg/client"
 	"sync"
 	"testing"
 	"time"
+
+	"sigs.k8s.io/controller-runtime/pkg/client"
 	...

370-372: TODO comments should reference issues.

Per coding guidelines, TODO comments should reference GitHub issues. These TODOs indicate potential issues with state transition handling that should be tracked.

-	// TODO: ignoring error otherwise need to properly walk state transitions
-	_, _ = stateManager.UpdateNVSentinelStateNodeLabel(ctx, nodeName, statemanager.DrainSucceededLabelValue, false)
+	// TODO(#XXX): ignoring error otherwise need to properly walk state transitions
+	_, _ = stateManager.UpdateNVSentinelStateNodeLabel(ctx, nodeName, statemanager.DrainSucceededLabelValue, false)

1123-1125: Potential test flakiness with fixed sleep duration.

Using time.Sleep(500 * time.Millisecond) for synchronization can lead to flaky tests. Consider using assert.Eventually or require.Eventually with appropriate timeout and polling intervals for more reliable eventual consistency checks.

fault-remediation/pkg/reconciler/reconciler_test.go (2)

17-26: Non-standard import grouping.

Similar to the e2e test file, imports have inconsistent grouping. Standard library imports should be grouped together, followed by external packages, then internal packages.

Suggested import ordering
 import (
 	"context"
-	"errors"
 	"fmt"
-	"github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation"
-	"github.com/nvidia/nvsentinel/fault-remediation/pkg/events"
-	corev1 "k8s.io/api/core/v1"
-	ctrl "sigs.k8s.io/controller-runtime"
 	"testing"
 	"time"
+	"errors"
+
+	corev1 "k8s.io/api/core/v1"
+	ctrl "sigs.k8s.io/controller-runtime"
+	...
+
+	"github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation"
+	"github.com/nvidia/nvsentinel/fault-remediation/pkg/events"
 	...

120-135: MockNodeAnnotationManager always returns nil for Node.

The GetRemediationState mock always returns nil for the *corev1.Node return value. This is acceptable for current unit tests but may cause issues if tests need to verify node-related behavior. Consider documenting this limitation or adding a configurable node return value.

fault-remediation/pkg/reconciler/reconciler.go (1)

276-279: High cyclomatic complexity flagged by nolint.

The // nolint: cyclop // todo comment indicates this function has high complexity. While this is noted and not blocking, consider refactoring handleRemediationEvent in a follow-up to improve maintainabilityβ€”for example, by extracting the log collector, CR creation, and status update flows into separate helper methods.

Would you like me to open an issue to track the refactoring of handleRemediationEvent to reduce complexity?

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 1b3ff7f and 6806866.

β›” Files ignored due to path filters (1)
  • commons/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (29)
  • .gitignore
  • commons/go.mod
  • commons/pkg/statemanager/statemanager.go
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/annotation.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ’€ Files with no reviewable changes (7)
  • distros/kubernetes/nvsentinel/values.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • fault-remediation/pkg/reconciler/remediation_test.go
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • fault-remediation/pkg/reconciler/annotation.go
  • fault-remediation/pkg/reconciler/remediation.go
βœ… Files skipped from review due to trivial changes (2)
  • commons/go.mod
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
🚧 Files skipped from review as they are similar to previous changes (5)
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • commons/pkg/statemanager/statemanager.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧠 Learnings (14)
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/annotation/annotation_interface.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧬 Code graph analysis (7)
fault-remediation/pkg/annotation/annotation_test.go (2)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (23-25)
  • AnnotationKey (11-11)
  • EquivalenceGroupState (28-34)
fault-remediation/pkg/annotation/annotation.go (1)
  • AnnotationManager (16-18)
fault-remediation/pkg/remediation/remediation_test.go (4)
fault-remediation/pkg/config/config.go (2)
  • Template (47-50)
  • MaintenanceResource (27-44)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_COMPONENT_RESET (44-44)
fault-remediation/pkg/remediation/remediation.go (2)
  • NewRemediationClient (56-114)
  • FaultRemediationClient (43-54)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
fault-remediation/pkg/initializer/init.go (2)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewRemediationClient (56-114)
commons/pkg/statemanager/statemanager.go (2)
  • NewStateManager (205-209)
  • StateManager (196-199)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (23-25)
  • AnnotationKey (11-11)
  • EquivalenceGroupState (28-34)
fault-remediation/main.go (3)
fault-remediation/pkg/initializer/init.go (2)
  • InitializationParams (38-43)
  • InitializeAll (49-148)
commons/pkg/auditlogger/roundtripper.go (1)
  • NewAuditingRoundTripper (42-47)
fault-remediation/pkg/reconciler/reconciler.go (1)
  • FaultRemediationReconciler (61-69)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (27-44)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (3)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (15-20)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (9-11)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (16)
.gitignore (1)

143-143: Verify scope: this change appears orthogonal to PR objectives.

The PR objectives describe retry behavior and fault-remediation system refactoring, but this change affects only .gitignore. Confirm that core remediation/retry logic changes are included elsewhere in the PR.

distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)

96-111: Verify the app actually serves /readyz on the health port after this chart change.
This chart now hard-wires readiness to path: "/readyz" and port: "health" (Line 106-107). If the binary doesn’t expose /readyz on the health listener (or uses a different scheme/port), pods will flap.

fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

17-20: Confirm namespace handling for namespaced actions.
If RebootNode is namespaced and the client doesn’t set namespace after rendering, omitting metadata.namespace will cause create failures (or default-namespace surprises). Consider adding namespace: "{{.Namespace}}" gated by scope, or ensure the code always injects namespace for namespaced actions.

fault-remediation/pkg/annotation/annotation_interface.go (1)

9-34: LGTM: clear, minimal public contract and JSON shape.
AnnotationKey constant + typed RemediationStateAnnotation / EquivalenceGroupState matches intended annotation payload.

fault-remediation/pkg/annotation/annotation.go (1)

62-105: and

fault-remediation/pkg/crstatus/checker.go (2)

76-90: LGTM: Condition checking logic is correct.

The checkCondition method properly extracts nested status/conditions from the unstructured object and delegates to findConditionStatus and isTerminal. The boolean logic correctly returns true (skip creation) when terminal conditions aren't met.


28-44: LGTM: Clean refactor to controller-runtime client.

The struct and constructor are properly updated to use client.Client from controller-runtime. This aligns with the PR's goal of consolidating on controller-runtime patterns.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

208-217: LGTM: Proper controller-runtime manager setup for tests.

The manager is correctly initialized with the envtest config and metrics disabled (BindAddress: "0"), which is appropriate for test environments. Using envtest aligns with the coding guidelines.


302-329: LGTM: Test remediation client factory properly updated.

The createTestRemediationClient function correctly uses the new remediation.NewRemediationClient with controller-runtime client and returns the interface type. Template configuration is properly structured.

fault-remediation/pkg/reconciler/reconciler_test.go (2)

350-404: LGTM: Tests properly updated for new performRemediation signature.

The tests correctly handle the updated performRemediation signature that returns (string, error). Both success and failure paths are covered with appropriate assertions.


576-613: LGTM: Log collector test properly updated for new signature.

The test correctly uses the new RunLogCollectorJob signature returning (ctrl.Result, error) and properly verifies the conditional invocation based on EnableLogCollector and action type.

fault-remediation/pkg/reconciler/reconciler.go (5)

203-211: Early return on label update failure may leave state inconsistent.

If the label update to RemediatingLabelValue fails, the function returns immediately without attempting CR creation. This could leave the node in the previous state while the event is retried. Consider whether this is the desired behavior, or if the CR creation should still be attempted (with the label error logged but not blocking).

However, given the PR intent to "throw errors to trigger retries," this early return will cause controller-runtime to requeue and retry the entire operation, which is a valid approach for achieving consistency.

Verify that the retry behavior is acceptable for your use case. If label updates are flaky, this could cause repeated retries before CR creation succeeds.


218-245: LGTM: Error handling properly defers CR creation error for state update.

The code correctly captures createMaintenanceResourceError but continues to update the state label before returning. Using errors.Join to combine errors when both the CR creation and label update fail is the right approach. This ensures the caller receives all relevant error information for logging/debugging.


328-356: LGTM: Log collector and remediation flow properly chain errors.

The updated flow correctly:

  1. Returns early if log collector fails or requests requeue
  2. Captures performRemediationErr and uses it to determine nodeRemediatedStatus
  3. Uses errors.Join when both remediation and status update fail
  4. Returns the remediation error after status update attempt

This supports the PR goal of throwing errors to trigger retries.


418-459: Error handling in checkExistingCRStatus now returns errors instead of silently allowing.

The changes on lines 421 and 453-455 now return errors instead of silently allowing CR creation. This is a behavioral change:

  • Line 421: GetRemediationState failure now returns error (previously may have allowed creation)
  • Line 455: RemoveGroupFromState failure now returns (false, "", err) instead of allowing creation

This aligns with the PR intent to trigger retries on errors. However, ensure this doesn't cause infinite retry loops if these operations consistently fail for a node.

Verify that transient failures in annotation operations will eventually succeed on retry, and that permanent failures (e.g., node doesn't exist) are handled appropriately upstream.


66-87: LGTM: Public Config field and constructor updates.

The Config field is now exported (capitalized), and the constructor properly initializes both Config and annotationManager from the provided configuration. This aligns with the broader refactor to expose configuration for testing and external access.

Comment on lines +526 to +541
assert.Equal(t, result.RequeueAfter, tt.requeueTime)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Swap assert.Equal argument order (expected, actual).
assert.Equal(t, result.RequeueAfter, tt.requeueTime) should be assert.Equal(t, tt.requeueTime, result.RequeueAfter) for clearer failures.

Proposed diff
-			assert.Equal(t, result.RequeueAfter, tt.requeueTime)
+			assert.Equal(t, tt.requeueTime, result.RequeueAfter)
πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert.Equal(t, result.RequeueAfter, tt.requeueTime)
assert.Equal(t, tt.requeueTime, result.RequeueAfter)
πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation_test.go around lines 526 -
527, The test assertion uses assert.Equal with arguments reversed; change the
call in remediation_test.go from assert.Equal(t, result.RequeueAfter,
tt.requeueTime) to assert.Equal(t, tt.requeueTime, result.RequeueAfter) so the
expected value (tt.requeueTime) is first and the actual (result.RequeueAfter) is
second for clearer failure output.

Comment on lines 43 to 129
type FaultRemediationClient struct {
client client.Client
dryRunMode []string

// Multi-template support
remediationConfig config.TomlConfig
templates map[string]*template.Template // map from template file name to parsed template
templateMountPath string

annotationManager annotation.NodeAnnotationManagerInterface
statusChecker *crstatus.CRStatusChecker
}

func NewRemediationClient(
client client.Client,
dryRun bool,
remediationConfig config.TomlConfig,
) (*FaultRemediationClient, error) {
// Determine template mount path
templateMountPath := remediationConfig.Template.MountPath
if templateMountPath == "" {
return nil, fmt.Errorf("template mount path is not configured")
}

// Pre-load and parse all templates
templates := make(map[string]*template.Template)

// Load templates for multi-template actions
for actionName, maintenanceResource := range remediationConfig.RemediationActions {
if maintenanceResource.TemplateFileName == "" {
return nil, fmt.Errorf("remediation action %s is missing template file configuration", actionName)
}

tmpl, err := loadAndParseTemplate(templateMountPath, maintenanceResource.TemplateFileName, actionName)
if err != nil {
return nil, fmt.Errorf("failed to load template for action %s: %w", actionName, err)
}

templates[actionName] = tmpl
}

// Validate namespace configuration for namespaced resources
for actionName, maintenanceResource := range remediationConfig.RemediationActions {
if maintenanceResource.Scope == "Namespaced" && maintenanceResource.Namespace == "" {
return nil, fmt.Errorf("remediation action %s is namespaced but missing namespace configuration", actionName)
}
}

ctrlRuntimeRemediationClient := &FaultRemediationClient{
client: client,
templates: templates,
templateMountPath: templateMountPath,
remediationConfig: remediationConfig,
}

if dryRun {
ctrlRuntimeRemediationClient.dryRunMode = []string{metav1.DryRunAll}
} else {
ctrlRuntimeRemediationClient.dryRunMode = []string{}
}

// Initialize annotation manager
ctrlRuntimeRemediationClient.annotationManager = annotation.NewAnnotationManager(client)

ctrlRuntimeRemediationClient.statusChecker = crstatus.NewCRStatusChecker(
client,
remediationConfig.RemediationActions,
dryRun,
)

return ctrlRuntimeRemediationClient, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Add missing godoc for exported FaultRemediationClient and NewRemediationClient.
Both are exported but lack leading comments (per guidelines).

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go around lines 43 - 114, Add
proper godoc comments above the exported FaultRemediationClient type and the
NewRemediationClient function: for FaultRemediationClient add a one-line summary
describing its role (e.g., "FaultRemediationClient manages remediation
templates, annotation manager and status checking for remediation actions.") and
for NewRemediationClient add a summary plus brief parameter/return description
(single sentence each) explaining the client, dryRun flag, and remediationConfig
parameters and what the function returns (constructed *FaultRemediationClient or
error). Ensure the comments start immediately above the declarations and follow
Go doc style (begin with the symbol name).

@ivelichkovich ivelichkovich force-pushed the remediationerrors branch 4 times, most recently from 4cea34a to aaf78c2 Compare January 9, 2026 01:03
@ivelichkovich ivelichkovich changed the title fix(remediation): retry on errors and throw errors to trigger retries fix(remediation): use ctrl-runtime business logic Jan 9, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

πŸ€– Fix all issues with AI agents
In `@fault-remediation/pkg/annotation/annotation.go`:
- Around line 99-101: The error text is inconsistent with the operation:
m.client.Update(...) returns an error but the message says "failed to patch node
annotation"; update the error message to reflect the actual method (e.g., change
the fmt.Errorf call to "failed to update node annotation: %w") or alternatively
switch to using client.Patch if a patch is intended, ensuring the message and
method match (reference m.client.Update and the fmt.Errorf call).
- Around line 159-168: The code mutates the original node when preparing a
Patch: after creating patchBase := node.DeepCopy() the code sets
node.Annotations[AnnotationKey] = string(stateJSON) instead of applying changes
to a copy; change the logic to DeepCopy into updatedNode (or reuse patchBase),
ensure you initialize updatedNode.Annotations if nil, set
updatedNode.Annotations[AnnotationKey] = string(stateJSON), then call
m.client.Patch(ctx, updatedNode, client.MergeFrom(patchBase)) so the patch is
computed against the original snapshot (consistent with UpdateRemediationState).
♻️ Duplicate comments (21)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

15-20: Consider quoting template placeholders for YAML linter compatibility.

Static analysis flags this file due to unquoted Go template placeholders. While Go's text/template package will render this correctly, quoting the placeholders makes the template file itself parseable as YAML by linters.

♻️ Optional fix to satisfy YAML linters
-apiVersion: {{.ApiGroup}}/{{.Version}}
+apiVersion: "{{.ApiGroup}}/{{.Version}}"
 kind: RebootNode
 metadata:
-  name: maintenance-{{.NodeName}}-{{.HealthEventID}}
+  name: "maintenance-{{.NodeName}}-{{.HealthEventID}}"
 spec:
-  nodeName: {{.NodeName}}
+  nodeName: "{{.NodeName}}"
fault-remediation/pkg/remediation/remediation_test.go (3)

29-97: Initialize the fake client in TestNewRemediationClient.

The client field in the test struct (line 32) is never populated, so NewRemediationClient always receives nil. While the current implementation may tolerate this, it makes the test brittle against future changes that expect a non-nil client.

♻️ Proposed fix
 func TestNewRemediationClient(t *testing.T) {
 	tests := []struct {
 		name        string
-		client      client.Client
 		dryRun      bool
 		wantErr     bool
 		templateDir string
 	}{
 		// ... test cases ...
 	}

 	for _, tt := range tests {
 		t.Run(tt.name, func(t *testing.T) {
+			fakeClient := fake.NewClientBuilder().Build()
 			testConfig := config.TomlConfig{
 				// ...
 			}
-			result, err := NewRemediationClient(tt.client, tt.dryRun, testConfig)
+			result, err := NewRemediationClient(fakeClient, tt.dryRun, testConfig)

377-381: Dry-run test case missing templateDir will not test intended behavior.

This test case doesn't set templateDir. While it may pass because dry-run returns early, it doesn't verify that dry-run correctly skips job creation when the template is valid. Add templateDir: "templates" to properly test the dry-run skip scenario.

πŸ”§ Proposed fix
 		{
 			name:          "Skip creation with dry run",
 			dryRun:        true,
+			templateDir:   "templates",
 			expectedError: false,
 		},

526-526: Swap assert.Equal argument order (expected, actual).

The testify convention is assert.Equal(t, expected, actual). Currently it's assert.Equal(t, result.RequeueAfter, tt.requeueTime) which reverses the order, producing confusing failure messages.

πŸ”§ Proposed fix
-			assert.Equal(t, result.RequeueAfter, tt.requeueTime)
+			assert.Equal(t, tt.requeueTime, result.RequeueAfter)
fault-remediation/pkg/metrics/metrics.go (1)

30-30: TODO comment should reference an issue.

Per coding guidelines, TODO comments should reference issues in Go code. Please add an issue reference to this TODO.

Based on coding guidelines.

fault-remediation/main.go (1)

113-113: TODO comment should reference an issue.

Per coding guidelines, TODO comments should reference a GitHub issue for tracking.

Based on coding guidelines.

fault-remediation/pkg/annotation/annotation.go (1)

43-43: TODO should reference an issue.

Per coding guidelines, TODO comments should reference issues for tracking.

Based on coding guidelines.

fault-remediation/pkg/crstatus/checker.go (2)

34-44: Add godoc comment for exported constructor.

As per coding guidelines, function comments are required for all exported Go functions. The NewCRStatusChecker constructor is missing its godoc comment.

πŸ“ Suggested godoc
+// NewCRStatusChecker creates a new CRStatusChecker with the provided client,
+// remediation action configuration, and dry-run mode setting.
 func NewCRStatusChecker(
 	client client.Client,
 	remediationActions map[string]config.MaintenanceResource,
 	dryRun bool,
 ) *CRStatusChecker {

Based on coding guidelines: Function comments required for all exported Go functions.


70-72: Consider differentiating "not found" from transient errors.

Currently, any error from client.Get allows CR creation by returning false. While this favors availability, transient network errors or permission issues could lead to duplicate CR creation attempts. Consider checking for NotFound specifically.

♻️ Suggested improvement
+import apierrors "k8s.io/apimachinery/pkg/api/errors"
 ...

 	if err := c.client.Get(ctx, key, obj); err != nil {
+		if apierrors.IsNotFound(err) {
+			slog.Debug("CR not found, allowing create", "crName", crName, "gvk", gvk.String())
+			return false
+		}
-		slog.Warn("Failed to get CR, allowing create", "crName", crName, "gvk", gvk.String(), "error", err)
+		slog.Error("Failed to get CR status, skipping create to avoid duplicates", "crName", crName, "gvk", gvk.String(), "error", err)
-		return false
+		return true // Skip creation on transient errors to avoid duplicates
 	}

This is a judgment call depending on your preference for availability vs. consistency. The current approach favors availability (create if unsure).

fault-remediation/pkg/remediation/remediation.go (7)

1-1: Add package-level godoc comment.

As per coding guidelines, package-level godoc is required for all Go packages.

πŸ“ Suggested package documentation
+// Package remediation provides functionality for managing fault remediation workflows,
+// including maintenance resource creation, log collection, and remediation state tracking.
 package remediation

Based on coding guidelines: Package-level godoc required for all Go packages.


150-150: TODO comment should reference an issue.

As per coding guidelines, TODO comments should reference issues in Go code. The // nolint: cyclop // todo comment needs an issue reference.

Based on coding guidelines: TODO comments should reference issues.


387-392: Overwriting job labels discards manifest-defined labels.

Setting job.Labels = labels replaces any labels defined in the manifest template. Consider merging labels instead to preserve manifest defaults.

♻️ Suggested fix
 	labels := map[string]string{
 		logCollectorNodeLabel:  nodeName,
 		logCollectorEventLabel: eventUID,
 	}

-	job.Labels = labels
+	if job.Labels == nil {
+		job.Labels = make(map[string]string)
+	}
+	for k, v := range labels {
+		job.Labels[k] = v
+	}

478-478: TODO comments should reference issues.

Multiple //nolint:nestif // todo comments need issue references as per coding guidelines.

Based on coding guidelines: TODO comments should reference issues.

Also applies to: 518-518, 571-571


117-140: Harden template loading against path traversal.

filepath.Join(mountPath, fileName) accepts ../ sequences from config, potentially allowing reads outside the intended directory. Validate that fileName is a base name without path separators.

πŸ”’ Proposed fix
 func loadAndParseTemplate(mountPath, fileName, templateName string) (*template.Template, error) {
+	// Validate fileName is a plain base name (no path traversal)
+	if filepath.Base(fileName) != fileName || strings.Contains(fileName, "..") {
+		return nil, fmt.Errorf("invalid template file name: %s", fileName)
+	}
+
 	templatePath := filepath.Join(mountPath, fileName)

214-216: Avoid logging rendered YAML at debug level - may leak secrets.

The rendered YAML could contain sensitive data embedded in templates (credentials, tokens). Consider removing this log or redacting sensitive fields.

♻️ Proposed fix
-	slog.Debug("Generated YAML from template",
-		"template", actionKey,
-		"yaml", yamlStr)
+	slog.Debug("Generated YAML from template",
+		"template", actionKey,
+		"yamlLength", len(yamlStr))

537-542: Missing nil check for job.Status.StartTime in failed job handling.

Lines 539 and 541 access job.Status.StartTime.Time without checking if StartTime is nil. A job could fail before starting (e.g., scheduling failure), causing a panic.

πŸ› Suggested fix
 			var duration float64
-			if job.Status.CompletionTime != nil {
+			if job.Status.StartTime == nil {
+				duration = 0
+			} else if job.Status.CompletionTime != nil {
 				duration = job.Status.CompletionTime.Sub(job.Status.StartTime.Time).Seconds()
 			} else {
 				duration = time.Since(job.Status.StartTime.Time).Seconds()
 			}
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

370-371: TODO comments should reference issues.

Multiple TODO comments throughout the test file lack issue references. As per coding guidelines, TODO comments should reference tracking issues.

Examples at lines 370, 432, 490, 521, 573.

Based on coding guidelines: TODO comments should reference issues.

Also applies to: 432-433, 490-492, 521-522, 573-575


302-328: Template path "./templates" is fragile for CI.

The relative path "./templates" depends on the working directory when go test runs, which can vary in CI environments. Consider resolving the path relative to the test file location.

♻️ Proposed fix using runtime.Caller
+import (
+  "runtime"
+  ...
+)

 func createTestRemediationClient(dryRun bool) (remediation.FaultRemediationClientInterface, error) {
+  _, thisFile, _, _ := runtime.Caller(0)
+  templatesDir := filepath.Join(filepath.Dir(thisFile), "templates")
+
   remediationConfig := config.TomlConfig{
     Template: config.Template{
-      MountPath: "./templates",
+      MountPath: templatesDir,
       FileName:  "rebootnode-template.yaml",
     },
fault-remediation/pkg/reconciler/reconciler.go (3)

66-67: Consider making Config field private.

The Config field was exported in this refactoring but appears to only be accessed within the package (excluding tests). If external access is not required, consider keeping it private (config) to reduce the public API surface. However, if tests in other packages need access, the current approach is acceptable.

#!/bin/bash
# Check if Config field is accessed outside the reconciler package (excluding test files)
echo "=== Config access outside reconciler package (excluding test files) ==="
rg '\.Config\.' --type go -g '!*_test.go' fault-remediation/pkg/ | grep -v 'pkg/reconciler/'

echo "=== Config access in main.go or initializer ==="
rg '\.Config' fault-remediation/main.go fault-remediation/pkg/initializer/

279-279: TODO comment should reference an issue.

The // nolint: cyclop // todo directive should reference a tracking issue per coding guidelines.

Based on coding guidelines: TODO comments should reference issues.


493-498: Error message is misleading.

Line 498 returns "error updating resume token: %w" but wraps err which is the parse error from eventutil.ParseHealthEventFromEvent, not a resume token error. The MarkProcessed error is logged but not returned.

πŸ› Proposed fix
 		if markErr := watcherInstance.MarkProcessed(context.Background(), eventWithToken.ResumeToken); markErr != nil {
 			metrics.ProcessingErrors.WithLabelValues("mark_processed_error", "unknown").Inc()
 			slog.Error("Error updating resume token", "error", markErr)
 		}

-		return result, fmt.Errorf("error updating resume token: %w", err)
+		return result, fmt.Errorf("error parsing health event: %w", err)
🧹 Nitpick comments (7)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)

9-11: Add documentation and name interface method parameters.

The exported interface and method lack documentation, and the method parameters are unnamed, reducing API clarity. As per coding guidelines, function comments are required for all exported Go functions.

πŸ“ Proposed documentation and parameter naming
+// CRStatusCheckerInterface determines whether a new Custom Resource should be created
+// based on the status of existing CRs.
 type CRStatusCheckerInterface interface {
-	ShouldSkipCRCreation(context.Context, string, string) bool
+	// ShouldSkipCRCreation checks if CR creation should be skipped for the given action and CR name.
+	// Returns true if creation should be skipped (e.g., CR already exists or is in progress), false otherwise.
+	ShouldSkipCRCreation(ctx context.Context, actionName string, crName string) bool
 }

Based on coding guidelines, function comments are required for all exported Go functions.

fault-remediation/pkg/annotation/annotation_test.go (1)

3-13: Consider grouping imports by category.

Imports are functional but could be organized into standard library, external, and internal groups for readability. This is a minor style consideration.

♻️ Optional import organization
 import (
 	"context"
 	"fmt"
-	"github.com/stretchr/testify/assert"
-	corev1 "k8s.io/api/core/v1"
-	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
-	"k8s.io/apimachinery/pkg/types"
-	"sigs.k8s.io/controller-runtime/pkg/client/fake"
 	"testing"
 	"time"
+
+	"github.com/stretchr/testify/assert"
+	corev1 "k8s.io/api/core/v1"
+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/apimachinery/pkg/types"
+	"sigs.k8s.io/controller-runtime/pkg/client/fake"
 )
fault-remediation/main.go (1)

42-48: Move scheme declaration before init() for clarity.

The scheme variable is used in init() at lines 43-44 but declared after init() at line 48. While Go allows this (package-level declarations are hoisted), placing the declaration before init() improves readability.

♻️ Suggested reordering
+var scheme = runtime.NewScheme()
+
 func init() {
 	utilruntime.Must(corev1.AddToScheme(scheme))
 	utilruntime.Must(batchv1.AddToScheme(scheme))
 }

-var (
-	scheme = runtime.NewScheme()
+var (
 	// These variables will be populated during the build process
fault-remediation/pkg/initializer/init.go (1)

85-101: Consider a more descriptive log message.

The log message "Successfully initialized client" at line 101 is generic. Consider being more specific about what was initialized (e.g., "Successfully initialized remediation client and state manager").

♻️ Proposed fix
-	slog.Info("Successfully initialized client")
+	slog.Info("Successfully initialized remediation client and state manager")
fault-remediation/pkg/remediation/remediation.go (1)

406-412: Error message logs full job list which may be large.

Line 411 includes existingJobs.Items in the error message. For debugging, consider logging only job names/UIDs instead of the full object list.

♻️ Suggested fix
 	if len(existingJobs.Items) > 1 {
+		jobNames := make([]string, len(existingJobs.Items))
+		for i, j := range existingJobs.Items {
+			jobNames[i] = j.Name
+		}
 		return batchv1.Job{},
 			ctrl.Result{},
-			fmt.Errorf("expecting zero or one log collector job per event per node, found %v", existingJobs.Items)
+			fmt.Errorf("expecting zero or one log collector job per event per node, found %d: %v", len(existingJobs.Items), jobNames)
 	}
fault-remediation/pkg/reconciler/reconciler_test.go (2)

43-55: Mock ignores eventId parameter in RunLogCollectorJob.

The mock's runLogCollectorJobFn signature takes only nodeName but the interface method takes both nodeName and eventId. Line 54 ignores the eventId argument. For more accurate testing, consider updating the mock signature.

♻️ Suggested fix
 type MockK8sClient struct {
 	createMaintenanceResourceFn func(ctx context.Context, healthEventData *events.HealthEventData) (string, error)
-	runLogCollectorJobFn        func(ctx context.Context, nodeName string) (ctrl.Result, error)
+	runLogCollectorJobFn        func(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error)
 	annotationManagerOverride   annotation.NodeAnnotationManagerInterface
 	realStatusChecker           crstatus.CRStatusCheckerInterface
 }
 ...
 func (m *MockK8sClient) RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error) {
-	return m.runLogCollectorJobFn(ctx, nodeName)
+	return m.runLogCollectorJobFn(ctx, nodeName, eventId)
 }

85-110: Consider consolidating redundant mock wrapper types.

MockCRStatusChecker, TestCRStatusChecker, and MockCRStatusCheckerWrapper all implement similar functionality. Consider consolidating into a single mock type to reduce code duplication.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between eaa87e1 and cae37c0.

β›” Files ignored due to path filters (1)
  • commons/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (32)
  • .gitignore
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/go.mod
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/annotation.go
  • fault-remediation/pkg/reconciler/annotation_test.go
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
  • tilt/csp-api-mock/Tiltfile
πŸ’€ Files with no reviewable changes (9)
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
  • distros/kubernetes/nvsentinel/values.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • fault-remediation/pkg/reconciler/annotation.go
  • fault-remediation/pkg/reconciler/annotation_test.go
  • fault-remediation/pkg/reconciler/remediation_test.go
  • fault-remediation/pkg/reconciler/remediation.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/events/health_event.go
  • .gitignore
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
🧰 Additional context used
πŸ““ Path-based instructions (3)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/main.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/annotation/annotation_test.go
**/go.mod

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

  • fault-remediation/go.mod
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/annotation/annotation_test.go
🧠 Learnings (32)
πŸ““ Common learnings
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/annotation/annotation.go:131-166
Timestamp: 2026-01-15T18:23:48.147Z
Learning: In fault-remediation/pkg/annotation/annotation.go, the node annotation for remediation state is designed with the assumption that only one controller should be acting on a given equivalence group at a time. Concurrent modifications to the same part of the node annotation aren't expected in normal operation.
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/crstatus/checker.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/crstatus/checker.go
πŸ“š Learning: 2026-01-15T18:25:15.442Z
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/remediation/remediation.go:469-504
Timestamp: 2026-01-15T18:25:15.442Z
Learning: When handling Kubernetes Jobs, if batch/v1 JobComplete is true, Job.Status.StartTime is guaranteed to be non-nil by the API. Therefore, in remediation.go (and similar code paths), you can omit a nil check for StartTime when the Complete condition is true. Keep nil checks only for scenarios where StartTime may legitimately be absent (e.g., before the Job starts).

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/main.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/annotation/annotation_test.go
πŸ“š Learning: 2026-01-06T21:31:36.113Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.

Applied to files:

  • fault-remediation/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

  • fault-remediation/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/go.mod
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/crstatus/checker.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

  • fault-remediation/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks

Applied to files:

  • fault-remediation/go.mod
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/annotation/annotation_test.go
πŸ“š Learning: 2026-01-15T18:23:48.147Z
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/annotation/annotation.go:131-166
Timestamp: 2026-01-15T18:23:48.147Z
Learning: In fault-remediation/pkg/annotation/annotation.go, the node annotation for remediation state is designed with the assumption that only one controller should be acting on a given equivalence group at a time. Concurrent modifications to the same part of the node annotation aren't expected in normal operation.

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ“š Learning: 2026-01-15T18:16:14.309Z
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/annotation/annotation.go:50-57
Timestamp: 2026-01-15T18:16:14.309Z
Learning: In fault-remediation/pkg/annotation/annotation.go, corrupt remediation state annotations are intentionally treated as empty state (returning an empty RemediationStateAnnotation) rather than returning an error. This graceful degradation prevents node workflows from getting stuck if manual actions or other issues corrupt the annotation. The unmarshal error is still logged for visibility.

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/annotation/annotation_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/annotation/annotation_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2026-01-14T02:33:07.679Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 689
File: janitor/pkg/controller/rebootnode_controller_test.go:371-436
Timestamp: 2026-01-14T02:33:07.679Z
Learning: In the NVSentinel janitor controller tests, tests that demonstrate original bugs or issues that were fixed by a PR should be kept for posterity, even if they reference removed functionality like MaxRebootRetries or RetryCount fields. These historical test cases serve as documentation of what problem was being solved.

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code

Applied to files:

  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/metrics/metrics.go
πŸ“š Learning: 2026-01-12T05:13:24.947Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:24.947Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.

Applied to files:

  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2026-01-09T18:55:38.501Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml:20-28
Timestamp: 2026-01-09T18:55:38.501Z
Learning: The janitor-provider gRPC service only requires get/list/watch permissions on nodes in its ClusterRole. It reads node metadata and delegates to CSP APIs. The janitor controller (separate component) performs actual Kubernetes node modifications including deletion and has its own RBAC configuration with appropriate write permissions.

Applied to files:

  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
🧬 Code graph analysis (7)
fault-remediation/pkg/crstatus/crstatus_test.go (1)
fault-remediation/pkg/crstatus/checker.go (1)
  • NewCRStatusChecker (34-44)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-35)
fault-remediation/pkg/initializer/init.go (3)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewRemediationClient (57-115)
commons/pkg/statemanager/statemanager.go (2)
  • NewStateManager (206-210)
  • StateManager (197-200)
store-client/pkg/datastore/config.go (1)
  • LoadDatastoreConfig (27-44)
fault-remediation/pkg/remediation/remediation_test.go (3)
fault-remediation/pkg/config/config.go (2)
  • Template (47-50)
  • MaintenanceResource (27-44)
fault-remediation/pkg/remediation/remediation.go (2)
  • NewRemediationClient (57-115)
  • FaultRemediationClient (44-55)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
fault-remediation/pkg/reconciler/reconciler.go (7)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • FaultRemediationClientInterface (29-35)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/metrics/metrics.go (5)
  • EventHandlingDuration (62-68)
  • TotalEventsReceived (33-38)
  • TotalUnsupportedRemediationActions (53-59)
  • ProcessingErrors (46-52)
  • EventsProcessed (39-45)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_NONE (96-96)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventDoc (6-9)
  • HealthEventData (12-15)
data-models/pkg/model/health_event_extentions.go (1)
  • HealthEventWithStatus (53-57)
store-client/pkg/utils/document_utils.go (1)
  • ExtractDocumentID (59-84)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (27-44)
fault-remediation/pkg/annotation/annotation_test.go (2)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-35)
fault-remediation/pkg/annotation/annotation.go (1)
  • NodeAnnotationManager (19-21)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (27)
tilt/csp-api-mock/Tiltfile (1)

26-30: Port forward change is appropriate.

Remapping from 8080:8080 to 8081:8080 avoids conflicts with other services in local development. The container continues to listen on port 8080, and all in-cluster pod-to-pod references correctly use that port. No updates to other files are required.

fault-remediation/go.mod (1)

75-75: Confirm the dependency move is intentional and version alignment is correct.

The change successfully moves golang.org/x/sync from direct to indirect dependency. Verification confirms no direct imports of this package remain in the fault-remediation codebase, confirming this is appropriate cleanup after the controller-runtime refactoring.

However, note that v0.19.0 is a higher version number than the latest released stable version v0.17.0 (as of January 2026). This version discrepancy should be reviewedβ€”v0.19.0 may be a pre-release or previously yanked version. Consider whether this version should be aligned with the latest stable release or if there is a specific reason for using a pre-release version.

β›” Skipped due to learnings
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml (3)

71-71: LGTM - watch verb added for nodes.

Adding watch permission for nodes aligns with the controller-runtime informer-based workflow introduced in this PR.


86-86: LGTM - update verb added for jobs.

Adding update permission for jobs resource is appropriate if the controller needs to modify job specifications or annotations.


39-52: The watch verb is not required for the remediation controller's RBAC configuration.

The remediation controller does not use controller-runtime informers to watch custom remediation CRs. Instead, it sources events from MongoDB change streams via a TypedChannel in SetupWithManager, then performs CRUD operations (create, get, list, update, patch) on the remediation resources themselves. The current RBAC verbs correctly reflect these actual API operations and do not need the watch verb.

Likely an incorrect or invalid review comment.

fault-remediation/pkg/crstatus/crstatus_test.go (2)

33-34: LGTM - Constructor signature updated correctly.

The constructor call is updated to match the new NewCRStatusChecker(client, remediationActions, dryRun) signature. Passing nil for the client is appropriate here since checkCondition operates on the provided *unstructured.Unstructured object directly without making client calls.


26-121: Well-structured table-driven tests.

The test covers comprehensive scenarios for condition checking: no status, condition true/false/unknown, and condition not found. The table-driven approach aligns with Go testing best practices.

fault-remediation/pkg/annotation/annotation_test.go (4)

15-120: Comprehensive table-driven tests for GetRemediationState.

The test covers key scenarios including node not found, missing annotation, corrupted annotation, and valid annotation parsing. Using Unix() for time comparison avoids nanosecond precision issues.


122-146: LGTM - UpdateRemediationState test.

Test correctly verifies state creation and retrieval with proper error assertions.


148-171: LGTM - ClearRemediationState test.

Test correctly verifies annotation removal with proper error assertions.


173-207: LGTM - RemoveGroupFromState test.

Test correctly verifies selective group removal while preserving other groups, with proper error assertions.

fault-remediation/pkg/remediation/remediation_test.go (2)

99-193: LGTM!

The E2E tests for missing template file scenarios are well-structured with proper temporary directory setup, clear test case isolation, and specific error message assertions.


195-344: LGTM!

The test function properly initializes the fake client with WithObjects, uses unique test case names, and correctly validates resource creation including dry-run behavior verification.

fault-remediation/pkg/metrics/metrics.go (1)

33-92: LGTM!

The metrics have been correctly exported with proper CamelCase naming conventions. The metric definitions (names, help texts, labels) remain consistent with their previous implementations while now being accessible from other packages.

fault-remediation/main.go (1)

105-175: LGTM on unified controller-runtime setup.

The setupCtrlRuntimeManagement function properly:

  • Configures the manager with leader election options
  • Wraps the HTTP transport with auditing
  • Sets up health/ready checks
  • Passes Config from manager to InitializationParams
  • Defers cleanup after successful initialization
fault-remediation/pkg/annotation/annotation.go (2)

1-28: LGTM on package structure and constructor.

Package-level godoc is present and the NodeAnnotationManager constructor is clean and follows Go conventions.


53-66: LGTM on graceful degradation for corrupt annotations.

The handling of unmarshal errors by returning an empty state rather than an error is intentional to prevent workflows from getting stuck. The nil map check at lines 62-64 properly guards against panics from JSON lacking the equivalenceGroups field.

Based on learnings about graceful degradation for corrupt annotations.

fault-remediation/pkg/initializer/init.go (2)

40-56: LGTM on updated initialization signature.

The InitializationParams struct correctly replaced KubeconfigPath with Config *rest.Config, and InitializeAll now properly accepts the controller-runtime client parameter.


135-151: LGTM on reconciler configuration wiring.

The ReconcilerConfig correctly wires the new RemediationClient and StateManager, and the reconciler is properly instantiated with all required dependencies.

fault-remediation/pkg/crstatus/checker.go (1)

78-92: LGTM - condition checking logic is sound.

The checkCondition method properly handles nested status/conditions extraction with appropriate error handling and fallback behavior.

fault-remediation/pkg/remediation/remediation.go (1)

57-115: LGTM - Well-structured client initialization with comprehensive validation.

The constructor properly validates template mount path, loads and parses all templates upfront, validates namespace configuration for namespaced resources, and initializes annotation/status checking components.

fault-remediation/pkg/reconciler/reconciler_test.go (1)

200-244: LGTM - Table-driven tests properly structured.

Tests follow the recommended pattern with descriptive names and clear assertions.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

731-939: LGTM - Comprehensive E2E test coverage.

The TestFullReconcilerWithMockedMongoDB_E2E test thoroughly covers:

  • Complete flow with event loop
  • CR creation and annotation verification
  • Deduplication behavior
  • Unquarantine event handling
  • Metrics verification

Good use of assert.Eventually for async operations.

fault-remediation/pkg/reconciler/reconciler.go (4)

171-194: LGTM - Log collector error handling correctly separates error and requeue cases.

The runLogCollector method now properly returns errors only when actual errors occur, and returns non-zero results for requeue without wrapping nil errors. This addresses the past review concern.


196-246: LGTM - Remediation flow properly handles CR creation success separately from label update failures.

The performRemediation method now correctly:

  1. Updates state to "remediating"
  2. Attempts CR creation and tracks success/failure
  3. Updates final state label
  4. Returns joined errors when both fail, but prioritizes CR creation success

This addresses the past review concern about FaultRemediated status being incorrectly set when CR creation succeeds but label update fails.


329-336: LGTM - Requeue handling for log collector is now correct.

The code properly separates error handling from requeue handling:

  • Returns error when err != nil
  • Returns result with nil error when !result.IsZero() (requeue case)

This fixes the previous issue of wrapping nil errors.


546-575: LGTM - Clean controller-runtime integration.

The SetupWithManager and AdaptEvents functions properly integrate the reconciler with controller-runtime:

  • Starts watcher stream with provided context
  • Uses typed channel source with proper event adaptation
  • Gracefully handles context cancellation and channel closure

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@github-actions
Copy link

Merging this branch changes the coverage (2 decrease, 2 increase)

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/fault-remediation 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation 32.12% (+32.12%) 🌟
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus 30.58% (-12.37%) πŸ’€
github.com/nvidia/nvsentinel/fault-remediation/pkg/events 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 21.37% (-4.19%) πŸ‘Ž
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation 28.14% (+28.14%) 🌟

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/fault-remediation/main.go 0.00% (ΓΈ) 433 (-8) 0 433 (-8)
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation.go 32.12% (+32.12%) 358 (+358) 115 (+115) 243 (+243) 🌟
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation_interface.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/checker.go 30.58% (-12.37%) 206 (+57) 63 (-1) 143 (+58) πŸ’€
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/crstatus_interface.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/events/health_event.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer/init.go 0.00% (ΓΈ) 248 (+88) 0 248 (+88)
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics/metrics.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/annotation.go 0.00% (-24.29%) 0 (-280) 0 (-68) 0 (-212) πŸ’€ πŸ’€
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/fault_remediation_client_interface.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler.go 21.37% (-10.03%) 1287 (+427) 275 (+5) 1012 (+422) πŸ’€
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation.go 0.00% (-20.48%) 0 (-918) 0 (-188) 0 (-730) πŸ’€ πŸ’€
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/fault_remediation_client_interface.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation.go 28.14% (+28.14%) 1272 (+1272) 358 (+358) 914 (+914) 🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/crstatus_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/annotation_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation_test.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

πŸ€– Fix all issues with AI agents
In `@fault-remediation/pkg/initializer/init.go`:
- Line 69: Update the error string returned by the initializer to use consistent
capitalization for the TOML format; replace "toml Config" in the fmt.Errorf call
(the return statement in init.go) with either "TOML config" or "TOML
configuration" so the message reads e.g. "error while loading the TOML config:
%w" for consistent, conventional casing.
♻️ Duplicate comments (21)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

15-20: Quote template placeholders to satisfy YAML static analysis.

The static analyzer flags a YAML syntax error because unquoted Go template placeholders are not valid YAML. While the template works correctly at runtime after Go processes it, quoting the placeholders makes the template file itself valid YAML and silences linter warnings.

Suggested fix
-apiVersion: {{.ApiGroup}}/{{.Version}}
+apiVersion: "{{.ApiGroup}}/{{.Version}}"
 kind: RebootNode
 metadata:
-  name: maintenance-{{.NodeName}}-{{.HealthEventID}}
+  name: "maintenance-{{.NodeName}}-{{.HealthEventID}}"
 spec:
-  nodeName: {{.NodeName}}
+  nodeName: "{{.NodeName}}"
fault-remediation/pkg/events/health_event.go (1)

15-15: Add package-level documentation.

The package lacks a godoc comment describing its purpose. As per coding guidelines, package-level godoc is required for all Go packages.

Proposed documentation
+// Package events provides data structures for representing health events
+// with support for different serialization formats (JSON and BSON).
 package events
fault-remediation/pkg/annotation/annotation.go (2)

113-115: Error message inconsistent with operation.

Line 113 uses m.client.Update but the error message says "failed to patch node annotation". For clarity, update the message to match the operation.

πŸ”§ Proposed fix
 	if err = m.client.Update(ctx, updatedNode); err != nil {
-		return fmt.Errorf("failed to patch node annotation: %w", err)
+		return fmt.Errorf("failed to update node annotation: %w", err)
 	}

173-182: Bug: Mutating original node instead of a copy before patching.

Lines 173-178 create patchBase from node.DeepCopy(), then modify node.Annotations directly. The patch should modify a copy while keeping the original as the base. This is inconsistent with UpdateRemediationState which correctly modifies updatedNode.

πŸ”§ Proposed fix
-	patchBase := node.DeepCopy()
-	if node.Annotations == nil {
-		node.Annotations = map[string]string{}
+	updatedNode := node.DeepCopy()
+	if updatedNode.Annotations == nil {
+		updatedNode.Annotations = map[string]string{}
 	}
 
-	node.Annotations[AnnotationKey] = string(stateJSON)
+	updatedNode.Annotations[AnnotationKey] = string(stateJSON)
 
-	if err = m.client.Patch(ctx, node, client.MergeFrom(patchBase)); err != nil {
+	if err = m.client.Patch(ctx, updatedNode, client.MergeFrom(node)); err != nil {
 		return fmt.Errorf("failed to patch node annotation: %w", err)
 	}
fault-remediation/pkg/remediation/remediation_test.go (2)

540-540: Swap assert.Equal arguments for clearer failure messages.

The testify convention is assert.Equal(t, expected, actual). Currently the arguments are reversed, which will produce confusing failure messages.

πŸ”§ Proposed fix
-			assert.Equal(t, result.RequeueAfter, tt.requeueTime)
+			assert.Equal(t, tt.requeueTime, result.RequeueAfter)

378-395: Dry-run test case missing templateDir may not test intended behavior.

The "Skip creation with dry run" test case doesn't set templateDir. If RunLogCollectorJob checks for templates before dry-run logic, this test may fail for the wrong reason. Consider adding templateDir: "templates" to ensure the test exercises the dry-run skip path.

πŸ”§ Proposed fix
 		{
 			name:          "Skip creation with dry run",
 			dryRun:        true,
+			templateDir:   "templates",
 			expectedError: false,
 		},
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)

15-15: Add package-level documentation.

The package declaration lacks a package-level godoc comment. As per coding guidelines, all Go packages require package-level documentation.

πŸ“ Suggested package documentation
+// Package remediation provides interfaces and types for fault remediation operations,
+// including maintenance resource creation, log collection, and node annotation management.
 package remediation

Based on coding guidelines.


29-35: Add godoc for the exported interface.

The FaultRemediationClientInterface and its methods lack documentation. As per coding guidelines, function comments are required for all exported Go functions and interfaces.

πŸ“ Suggested documentation
+// FaultRemediationClientInterface defines the contract for fault remediation operations,
+// including CR creation, log collection, and state management.
 type FaultRemediationClientInterface interface {
+	// CreateMaintenanceResource creates a maintenance CR for the given health event and returns the CR name.
 	CreateMaintenanceResource(ctx context.Context, healthEventData *events.HealthEventData) (string, error)
+	// RunLogCollectorJob orchestrates log collection for a node and event, returning a reconcile result.
 	RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error)
+	// GetAnnotationManager returns the annotation manager for node remediation state tracking.
 	GetAnnotationManager() annotation.NodeAnnotationManagerInterface
+	// GetStatusChecker returns the CR status checker for determining CR creation eligibility.
 	GetStatusChecker() crstatus.CRStatusCheckerInterface
+	// GetConfig returns the remediation configuration.
 	GetConfig() *config.TomlConfig
 }

Based on coding guidelines.

fault-remediation/pkg/annotation/annotation_interface.go (1)

15-15: Package-level godoc missing.

While package-level godoc only needs to appear once per package, ensure one file in the annotation package (either this file or annotation.go) contains package-level documentation.

As per coding guidelines, package-level godoc is required for all Go packages.

fault-remediation/pkg/remediation/remediation.go (6)

15-15: Add package-level godoc comment.

As per coding guidelines, package-level godoc is required for all Go packages.

πŸ“ Suggested package documentation
+// Package remediation provides functionality for managing fault remediation workflows,
+// including maintenance resource creation, log collection, and remediation state tracking.
 package remediation

Based on coding guidelines.


164-164: TODO comment should reference an issue.

As per coding guidelines, TODO comments should reference issues in Go code.

-// nolint: cyclop // todo
+// nolint: cyclop // TODO(`#issue-number`): refactor to reduce cyclomatic complexity

Based on coding guidelines.


492-492: TODO comment should reference an issue.

As per coding guidelines, TODO comments should reference issues in Go code.

Based on coding guidelines.


532-532: TODO comment should reference an issue.

As per coding guidelines, TODO comments should reference issues in Go code.

Based on coding guidelines.


585-585: TODO comment should reference an issue.

As per coding guidelines, TODO comments should reference issues in Go code.

Based on coding guidelines.


551-556: Add nil check for StartTime in failed job duration calculation.

For failed jobs, job.Status.StartTime may be nil if the job failed before starting (e.g., scheduling failure). The current code would panic at lines 553 and 555 in such cases. Based on learnings, StartTime is guaranteed only when JobComplete is true, not for failed jobs.

πŸ“ Suggested fix
 			// Use job's actual duration for failed jobs too
 			var duration float64
-			if job.Status.CompletionTime != nil {
+			if job.Status.StartTime == nil {
+				duration = 0
+			} else if job.Status.CompletionTime != nil {
 				duration = job.Status.CompletionTime.Sub(job.Status.StartTime.Time).Seconds()
 			} else {
 				duration = time.Since(job.Status.StartTime.Time).Seconds()
 			}
fault-remediation/pkg/crstatus/checker.go (1)

34-44: Add godoc comment for exported constructor.

As per coding guidelines, function comments are required for all exported Go functions. Add a godoc comment describing the constructor's purpose and parameters.

πŸ“ Suggested godoc
+// NewCRStatusChecker creates a new CRStatusChecker with the provided controller-runtime
+// client, remediation action configuration, and dry-run mode setting.
 func NewCRStatusChecker(
 	client client.Client,
 	remediationActions map[string]config.MaintenanceResource,

Based on coding guidelines.

fault-remediation/pkg/initializer/init.go (1)

94-97: Improve error message clarity.

The error message "error init kube client for state manager" is grammatically awkward. Consider a clearer phrasing.

πŸ“ Suggested fix
-		return nil, fmt.Errorf("error init kube client for state manager: %w", err)
+		return nil, fmt.Errorf("failed to create kube client for state manager: %w", err)
fault-remediation/pkg/reconciler/reconciler.go (2)

66-67: Consider making Config field private.

The Config field is exported but based on past review comments, it's only accessed internally after initialization. Making it private (config instead of Config) would reduce the public API surface and follow Go conventions for unexported fields that aren't needed externally.

However, if external access is intentionally required for testing or other purposes, this can be deferred.


493-499: Misleading error message persists.

At line 498, the error message says "error updating resume token" but the wrapped error err is actually from eventutil.ParseHealthEventFromEvent. This creates confusion during debugging.

πŸ› Proposed fix
 		if markErr := watcherInstance.MarkProcessed(context.Background(), eventWithToken.ResumeToken); markErr != nil {
 			metrics.ProcessingErrors.WithLabelValues("mark_processed_error", "unknown").Inc()
 			slog.Error("Error updating resume token", "error", markErr)
 		}

-		return result, fmt.Errorf("error updating resume token: %w", err)
+		return result, fmt.Errorf("error parsing health event: %w", err)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

304-306: Template path may fail in CI environments.

The "./templates" relative path is sensitive to the working directory when go test is run. This can cause CI failures if tests are executed from a different directory.

♻️ Proposed fix using runtime.Caller
+import (
+	"runtime"
+	"path/filepath"
+)

 func createTestRemediationClient(dryRun bool) (remediation.FaultRemediationClientInterface, error) {
+	_, thisFile, _, _ := runtime.Caller(0)
+	templatesDir := filepath.Join(filepath.Dir(thisFile), "templates")
+
 	remediationConfig := config.TomlConfig{
 		Template: config.Template{
-			MountPath: "./templates",
+			MountPath: templatesDir,
 			FileName:  "rebootnode-template.yaml",
 		},
 		// ...
 	}

370-371: TODO comments should reference issues.

Multiple TODO comments throughout the test file lack issue references. Per coding guidelines, TODO comments should reference issues for tracking. Consider creating issues for:

  1. State transition handling (lines 370, 432, 490, 573)
  2. StateManager error behavior questions (lines 491, 521)

Example format: // TODO(#123): ignoring error - need to properly walk state transitions

🧹 Nitpick comments (14)
fault-remediation/pkg/events/health_event.go (1)

25-29: Clarify the godoc comment for HealthEventData.

The current comment "for compatibility" is vague. Consider specifying what compatibility this provides (e.g., MongoDB/BSON serialization).

Suggested improvement
-// HealthEventData represents health event data with string ID for compatibility
+// HealthEventData represents health event data with BSON tags for MongoDB storage.
 type HealthEventData struct {
 	ID                          string `bson:"_id,omitempty"`
 	model.HealthEventWithStatus `bson:",inline"`
 }
fault-remediation/pkg/remediation/remediation_test.go (1)

43-111: Initialize the fake client in test cases instead of leaving tt.client unused.

The test struct defines a client field at line 46 that is never populated. While NewRemediationClient may accept nil currently, this makes the test brittle. Either remove the unused field or initialize a fake client.

♻️ Proposed fix
 func TestNewRemediationClient(t *testing.T) {
 	tests := []struct {
 		name        string
-		client      client.Client
 		dryRun      bool
 		wantErr     bool
 		templateDir string
 	}{
 		// ... test cases ...
 	}

 	for _, tt := range tests {
 		t.Run(tt.name, func(t *testing.T) {
+			fakeClient := fake.NewClientBuilder().Build()
 			testConfig := config.TomlConfig{
 				// ...
 			}
-			result, err := NewRemediationClient(tt.client, tt.dryRun, testConfig)
+			result, err := NewRemediationClient(fakeClient, tt.dryRun, testConfig)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)

23-25: Add documentation and named parameters to the interface method.

The exported interface and method lack documentation, and the method parameters are unnamed which reduces API clarity. Per coding guidelines, function comments are required for all exported Go functions.

πŸ“ Proposed improvement
+// CRStatusCheckerInterface determines whether a new Custom Resource should be created
+// based on the status of existing CRs.
 type CRStatusCheckerInterface interface {
-	ShouldSkipCRCreation(context.Context, string, string) bool
+	// ShouldSkipCRCreation checks if CR creation should be skipped for the given resource and node.
+	// Returns true if a CR already exists or creation should be skipped, false otherwise.
+	ShouldSkipCRCreation(ctx context.Context, resourceName string, nodeName string) bool
 }

Based on coding guidelines, function comments are required for all exported Go functions.

fault-remediation/main.go (1)

113-113: TODO comment should reference an issue.

Per coding guidelines, TODO comments should reference GitHub issues for tracking.

πŸ“ Suggested change
-	//TODO: setup informers for node and job
+	// TODO(#<issue-number>): setup informers for node and job

Based on coding guidelines, TODO comments should reference issues in Go code.

fault-remediation/pkg/annotation/annotation.go (1)

57-57: TODO comment should reference an issue.

Per coding guidelines, TODO comments should reference GitHub issues for tracking.

πŸ“ Suggested change
-	// TODO: maybe split this up so it's not returning both node and state
+	// TODO(#<issue-number>): consider splitting to separate node retrieval from state parsing

Based on coding guidelines, TODO comments should reference issues in Go code.

fault-remediation/pkg/initializer/init.go (1)

101-101: Update log message to be more specific.

The log message "Successfully initialized client" is generic. Since this follows the creation of both the remediation client and state manager, the message should reflect what was actually initialized.

πŸ“ Suggested fix
-	slog.Info("Successfully initialized client")
+	slog.Info("Successfully initialized remediation client and state manager")
fault-remediation/pkg/annotation/annotation_interface.go (1)

29-35: Consider adding method-level documentation to the interface.

While the interface has a brief comment, the individual methods lack documentation. Adding short godoc comments for each method would improve API clarity for consumers.

πŸ“ Suggested documentation
 // NodeAnnotationManagerInterface defines the interface for managing node annotations
 type NodeAnnotationManagerInterface interface {
+	// GetRemediationState retrieves the current remediation state annotation from the node.
 	GetRemediationState(ctx context.Context, nodeName string) (*RemediationStateAnnotation, *corev1.Node, error)
+	// UpdateRemediationState updates the remediation state for a specific equivalence group on the node.
 	UpdateRemediationState(ctx context.Context, nodeName string, group string, crName string, actionName string) error
+	// ClearRemediationState removes all remediation state from the node annotation.
 	ClearRemediationState(ctx context.Context, nodeName string) error
+	// RemoveGroupFromState removes a specific equivalence group from the node's remediation state.
 	RemoveGroupFromState(ctx context.Context, nodeName string, group string) error
 }
fault-remediation/pkg/remediation/remediation.go (4)

131-154: Consider validating template filename to prevent path traversal.

The loadAndParseTemplate function uses filepath.Join(mountPath, fileName) without validating that fileName doesn't contain path traversal sequences like ../. While the config is typically trusted, defense-in-depth suggests validating that the resolved path remains within the mount directory.

πŸ“ Suggested validation
 func loadAndParseTemplate(mountPath, fileName, templateName string) (*template.Template, error) {
+	// Validate filename doesn't contain path traversal
+	if filepath.Base(fileName) != fileName {
+		return nil, fmt.Errorf("invalid template filename: %s", fileName)
+	}
+
 	templatePath := filepath.Join(mountPath, fileName)

401-406: Merge labels instead of overwriting.

Setting job.Labels = labels replaces any labels defined in the manifest template. Consider merging labels to preserve manifest defaults.

πŸ“ Suggested fix
 	labels := map[string]string{
 		logCollectorNodeLabel:  nodeName,
 		logCollectorEventLabel: eventUID,
 	}

-	job.Labels = labels
+	if job.Labels == nil {
+		job.Labels = make(map[string]string)
+	}
+	for k, v := range labels {
+		job.Labels[k] = v
+	}

58-69: Add godoc for exported FaultRemediationClient struct.

The exported struct lacks documentation. As per coding guidelines, function comments are required for all exported Go functions, and similar documentation is expected for exported types.

πŸ“ Suggested documentation
+// FaultRemediationClient manages fault remediation workflows including maintenance
+// resource creation via templates, log collection, and node annotation state tracking.
 type FaultRemediationClient struct {

71-129: Add godoc for exported NewRemediationClient constructor.

The exported constructor lacks documentation describing its purpose and parameters.

πŸ“ Suggested documentation
+// NewRemediationClient creates a new FaultRemediationClient with the provided controller-runtime
+// client, dry-run mode setting, and remediation configuration. It pre-loads and validates all
+// templates and initializes the annotation manager and status checker.
 func NewRemediationClient(
fault-remediation/pkg/crstatus/checker.go (1)

70-73: Consider differentiating NotFound from other errors.

Currently, any error from client.Get allows CR creation (returns false). This is appropriate for NotFound errors but transient network errors or permission issues could lead to duplicate CR creation attempts. The current approach favors availability over consistency.

This was previously flagged and the current behavior may be intentional. If so, consider adding a comment documenting the design decision.

πŸ“ Optional: Add clarifying comment
 	if err := c.client.Get(ctx, key, obj); err != nil {
+		// Allow creation on any error (including transient errors) to favor availability.
+		// Duplicate CRs will be handled by the AlreadyExists check in the creation path.
 		slog.Warn("Failed to get CR, allowing create", "crName", crName, "gvk", gvk.String(), "error", err)
 		return false
 	}
fault-remediation/pkg/reconciler/reconciler_test.go (2)

53-55: Mock ignores eventId parameter.

The RunLogCollectorJob mock implementation ignores the eventId parameter entirely. While this simplifies test setup, it means tests won't catch bugs related to incorrect eventId propagation.

Consider capturing and validating the eventId in tests that care about it:

 func (m *MockK8sClient) RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error) {
-	return m.runLogCollectorJobFn(ctx, nodeName)
+	return m.runLogCollectorJobFn(ctx, nodeName, eventId)
 }

Then update the mock function signature accordingly in tests where eventId validation matters.


636-643: Test case has copy-paste artifacts.

The "Log collector job return requeue" test case appears to have duplicated values from the previous case. The nodeName and description don't match the scenario being tested.

♻️ Suggested fix
 		{
 			name:           "Log collector job return requeue",
-			nodeName:       "test-node-fail",
+			nodeName:       "test-node-requeue",
 			jobResult:      false,
 			expectedResult: false,
-			description:    "Error path - job fails to complete",
+			description:    "Requeue path - job returns non-zero result",
 			returnedResult: ctrl.Result{RequeueAfter: 5 * time.Minute},
 		},
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between cae37c0 and 4b221aa.

β›” Files ignored due to path filters (1)
  • commons/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (31)
  • .gitignore
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/go.mod
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/annotation.go
  • fault-remediation/pkg/reconciler/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
  • tilt/csp-api-mock/Tiltfile
πŸ’€ Files with no reviewable changes (8)
  • distros/kubernetes/nvsentinel/values-full.yaml
  • fault-remediation/pkg/reconciler/remediation_test.go
  • fault-remediation/pkg/reconciler/annotation.go
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/annotation_test.go
  • distros/kubernetes/nvsentinel/values-tilt.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/go.mod
  • .gitignore
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧠 Learnings (29)
πŸ““ Common learnings
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/annotation/annotation.go:131-166
Timestamp: 2026-01-15T18:23:48.147Z
Learning: In fault-remediation/pkg/annotation/annotation.go, the node annotation for remediation state is designed with the assumption that only one controller should be acting on a given equivalence group at a time. Concurrent modifications to the same part of the node annotation aren't expected in normal operation.
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2026-01-15T18:25:15.442Z
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/remediation/remediation.go:469-504
Timestamp: 2026-01-15T18:25:15.442Z
Learning: When handling Kubernetes Jobs, if batch/v1 JobComplete is true, Job.Status.StartTime is guaranteed to be non-nil by the API. Therefore, in remediation.go (and similar code paths), you can omit a nil check for StartTime when the Complete condition is true. Keep nil checks only for scenarios where StartTime may legitimately be absent (e.g., before the Job starts).

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2026-01-09T18:55:38.501Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml:20-28
Timestamp: 2026-01-09T18:55:38.501Z
Learning: The janitor-provider gRPC service only requires get/list/watch permissions on nodes in its ClusterRole. It reads node metadata and delegates to CSP APIs. The janitor controller (separate component) performs actual Kubernetes node modifications including deletion and has its own RBAC configuration with appropriate write permissions.

Applied to files:

  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
πŸ“š Learning: 2026-01-15T18:23:48.147Z
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/annotation/annotation.go:131-166
Timestamp: 2026-01-15T18:23:48.147Z
Learning: In fault-remediation/pkg/annotation/annotation.go, the node annotation for remediation state is designed with the assumption that only one controller should be acting on a given equivalence group at a time. Concurrent modifications to the same part of the node annotation aren't expected in normal operation.

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2026-01-14T02:33:07.679Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 689
File: janitor/pkg/controller/rebootnode_controller_test.go:371-436
Timestamp: 2026-01-14T02:33:07.679Z
Learning: In the NVSentinel janitor controller tests, tests that demonstrate original bugs or issues that were fixed by a PR should be kept for posterity, even if they reference removed functionality like MaxRebootRetries or RetryCount fields. These historical test cases serve as documentation of what problem was being solved.

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2026-01-15T18:16:14.309Z
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/annotation/annotation.go:50-57
Timestamp: 2026-01-15T18:16:14.309Z
Learning: In fault-remediation/pkg/annotation/annotation.go, corrupt remediation state annotations are intentionally treated as empty state (returning an empty RemediationStateAnnotation) rather than returning an error. This graceful degradation prevents node workflows from getting stuck if manual actions or other issues corrupt the annotation. The unmarshal error is still logged for visibility.

Applied to files:

  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code

Applied to files:

  • fault-remediation/pkg/remediation/remediation.go
πŸ“š Learning: 2026-01-12T05:13:24.947Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:24.947Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.

Applied to files:

  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧬 Code graph analysis (8)
fault-remediation/pkg/crstatus/crstatus_test.go (1)
fault-remediation/pkg/crstatus/checker.go (1)
  • NewCRStatusChecker (34-44)
fault-remediation/main.go (3)
commons/pkg/auditlogger/roundtripper.go (1)
  • NewAuditingRoundTripper (42-47)
fault-remediation/pkg/initializer/init.go (2)
  • InitializationParams (40-45)
  • InitializeAll (52-152)
fault-remediation/pkg/reconciler/reconciler.go (1)
  • FaultRemediationReconciler (61-69)
fault-remediation/pkg/remediation/remediation_test.go (3)
fault-remediation/pkg/config/config.go (2)
  • Template (47-50)
  • MaintenanceResource (27-44)
data-models/pkg/protos/health_event.pb.go (2)
  • RecommendedAction_RESTART_BM (101-101)
  • RecommendedAction_COMPONENT_RESET (97-97)
fault-remediation/pkg/remediation/remediation.go (2)
  • NewRemediationClient (71-129)
  • FaultRemediationClient (58-69)
fault-remediation/pkg/remediation/remediation.go (9)
fault-remediation/pkg/config/config.go (2)
  • Template (47-50)
  • MaintenanceResource (27-44)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (30-35)
fault-remediation/pkg/crstatus/checker.go (1)
  • CRStatusChecker (28-32)
fault-remediation/pkg/annotation/annotation.go (1)
  • NewNodeAnnotationManager (38-42)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (23-25)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (26-29)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • TemplateData (38-52)
fault-remediation/pkg/common/equivalence_groups.go (1)
  • GetRemediationGroupForAction (46-56)
fault-remediation/pkg/metrics/metrics.go (2)
  • LogCollectorJobs (71-77)
  • LogCollectorJobDuration (78-85)
fault-remediation/pkg/initializer/init.go (3)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewRemediationClient (71-129)
commons/pkg/statemanager/statemanager.go (2)
  • NewStateManager (206-210)
  • StateManager (197-200)
store-client/pkg/datastore/config.go (1)
  • LoadDatastoreConfig (27-44)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (38-40)
  • AnnotationKey (26-26)
  • EquivalenceGroupState (43-49)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (27-44)
fault-remediation/pkg/annotation/annotation_test.go (2)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (38-40)
  • AnnotationKey (26-26)
  • EquivalenceGroupState (43-49)
fault-remediation/pkg/annotation/annotation.go (1)
  • NodeAnnotationManager (33-35)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

var tomlConfig config.TomlConfig
if err := configmanager.LoadTOMLConfig(params.TomlConfigPath, &tomlConfig); err != nil {
return nil, fmt.Errorf("error while loading the toml config: %w", err)
return nil, fmt.Errorf("error while loading the toml Config: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Minor: Inconsistent capitalization in error message.

The error message uses "toml Config" with capital C, which is inconsistent with typical error message conventions.

πŸ“ Suggested fix
-		return nil, fmt.Errorf("error while loading the toml Config: %w", err)
+		return nil, fmt.Errorf("error while loading the toml config: %w", err)
πŸ€– Prompt for AI Agents
In `@fault-remediation/pkg/initializer/init.go` at line 69, Update the error
string returned by the initializer to use consistent capitalization for the TOML
format; replace "toml Config" in the fmt.Errorf call (the return statement in
init.go) with either "TOML config" or "TOML configuration" so the message reads
e.g. "error while loading the TOML config: %w" for consistent, conventional
casing.

@lalitadithya
Copy link
Collaborator

/ok to test 4b221aa

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

πŸ€– Fix all issues with AI agents
In `@fault-remediation/pkg/reconciler/reconciler_test.go`:
- Around line 636-643: The test case "Log collector job return requeue" in
reconciler_test.go duplicates the previous case's nodeName and description which
hurts clarity; update the test case values in the table (the struct literal used
for the table-driven tests) – change nodeName from "test-node-fail" to a unique
identifier like "test-node-requeue" and change description from "Error path -
job fails to complete" to something unique like "Error path - job returns
requeue" so the case is distinct when running tests and in reports; locate the
table entries in the test function (the slice of test structs used by Test... in
reconciler_test.go) and edit the fields for that specific case.

In `@fault-remediation/pkg/remediation/remediation_test.go`:
- Around line 43-111: TestNewRemediationClient declares tt.client but never sets
it, passing nil into NewRemediationClient and making the test brittle; create a
real fake controller-runtime client and assign it to tt.client before calling
NewRemediationClient. Specifically, in the test setup for
TestNewRemediationClient register the needed API types into a scheme (the same
scheme NewRemediationClient expects), build a fake client (e.g., with
controller-runtime's fake client builder) and assign it to tt.client so each
case uses a usable client instance; then call NewRemediationClient(tt.client,
tt.dryRun, testConfig) as before and assert results.
♻️ Duplicate comments (19)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

15-20: Consider quoting template placeholders for YAML linter compatibility.

The static analysis tool flags a syntax error because unquoted Go template placeholders aren't valid YAML. While the rendered output will be valid, quoting the values would make the template file itself pass YAML linting.

fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)

15-15: Add package-level documentation.

The package declaration lacks a package-level godoc comment. As per coding guidelines, all Go packages require package-level documentation.

πŸ“ Suggested package documentation
+// Package remediation provides interfaces and types for fault remediation operations,
+// including maintenance resource creation, log collection, and node annotation management.
 package remediation

29-35: Add godoc for the exported interface and methods.

The FaultRemediationClientInterface and its methods lack documentation. As per coding guidelines, function comments are required for all exported Go functions and interfaces.

πŸ“ Suggested documentation
+// FaultRemediationClientInterface defines the contract for fault remediation operations,
+// including CR creation, log collection, and state management.
 type FaultRemediationClientInterface interface {
+	// CreateMaintenanceResource creates a maintenance CR for the given health event and returns the CR name.
 	CreateMaintenanceResource(ctx context.Context, healthEventData *events.HealthEventData) (string, error)
+	// RunLogCollectorJob orchestrates log collection for a node and event, returning a reconcile result.
 	RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error)
+	// GetAnnotationManager returns the annotation manager for node remediation state tracking.
 	GetAnnotationManager() annotation.NodeAnnotationManagerInterface
+	// GetStatusChecker returns the CR status checker for determining CR creation eligibility.
 	GetStatusChecker() crstatus.CRStatusCheckerInterface
+	// GetConfig returns the remediation configuration.
 	GetConfig() *config.TomlConfig
 }
fault-remediation/main.go (1)

113-113: TODO comment should reference an issue.

Per coding guidelines, TODO comments should reference a GitHub issue for tracking.

Based on coding guidelines.

fault-remediation/pkg/initializer/init.go (1)

69-69: Minor: Inconsistent capitalization in error message.

The error message uses "toml Config" with capital C, which is inconsistent with typical error message conventions.

πŸ“ Suggested fix
-		return nil, fmt.Errorf("error while loading the toml Config: %w", err)
+		return nil, fmt.Errorf("error while loading the toml config: %w", err)
fault-remediation/pkg/annotation/annotation.go (2)

57-57: TODO should reference an issue.

Per coding guidelines, TODO comments should reference a GitHub issue for tracking.

Based on coding guidelines.


180-181: Error message says "patch" but operation is "Update".

Line 180 uses m.client.Update but the error message at line 181 says "failed to patch node annotation". For consistency, change the error message to "failed to update node annotation".

πŸ”§ Proposed fix
 	if err = m.client.Update(ctx, updatedNode); err != nil {
-		return fmt.Errorf("failed to patch node annotation: %w", err)
+		return fmt.Errorf("failed to update node annotation: %w", err)
 	}
fault-remediation/pkg/remediation/remediation_test.go (2)

391-395: Dry-run test case missing templateDir will fail before reaching dry-run logic.

This test case doesn't set templateDir, so RunLogCollectorJob will fail when trying to load the template, not due to dry-run behavior. This doesn't test the intended dry-run skip scenario.

πŸ”§ Proposed fix
 		{
 			name:          "Skip creation with dry run",
 			dryRun:        true,
+			templateDir:   "templates",
 			expectedError: false,
 		},

540-540: Swap assert.Equal argument order (expected, actual).

assert.Equal(t, result.RequeueAfter, tt.requeueTime) should be assert.Equal(t, tt.requeueTime, result.RequeueAfter) for clearer failure messages. The testify convention is assert.Equal(t, expected, actual).

πŸ”§ Proposed fix
-			assert.Equal(t, result.RequeueAfter, tt.requeueTime)
+			assert.Equal(t, tt.requeueTime, result.RequeueAfter)
fault-remediation/pkg/crstatus/checker.go (1)

34-44: Add godoc comment for exported constructor.

As per coding guidelines, function comments are required for all exported Go functions.

fault-remediation/pkg/remediation/remediation.go (4)

164-164: TODO comment should reference an issue.

As per coding guidelines, TODO comments should reference issues in Go code.


401-406: Overwriting job labels may discard manifest-defined labels.

Setting job.Labels = labels replaces any labels defined in the manifest template. Consider merging labels instead.


131-154: Harden template loading against path traversal.

filepath.Join(mountPath, fileName) will accept directory traversal sequences like ../ in fileName, potentially allowing access to files outside the intended mount path. Validate that fileName is a base name without path separators.

πŸ”§ Proposed fix
 func loadAndParseTemplate(mountPath, fileName, templateName string) (*template.Template, error) {
+	// Validate fileName is a base name (no path separators or traversal)
+	if filepath.Base(fileName) != fileName || strings.Contains(fileName, "..") {
+		return nil, fmt.Errorf("invalid template file name: %s", fileName)
+	}
+
 	templatePath := filepath.Join(mountPath, fileName)

550-556: Potential nil pointer dereference on job.Status.StartTime for failed jobs.

Lines 553 and 555 access job.Status.StartTime.Time without checking if StartTime is nil. Unlike completed jobs, a failed job may not have started (e.g., failed during scheduling), making StartTime nil.

πŸ› Proposed fix
 			// Use job's actual duration for failed jobs too
 			var duration float64
-			if job.Status.CompletionTime != nil {
+			if job.Status.StartTime == nil {
+				duration = 0
+			} else if job.Status.CompletionTime != nil {
 				duration = job.Status.CompletionTime.Sub(job.Status.StartTime.Time).Seconds()
 			} else {
 				duration = time.Since(job.Status.StartTime.Time).Seconds()
 			}
fault-remediation/pkg/reconciler/reconciler.go (3)

66-67: Consider making Config field private.

The Config field is exported but appears to only be accessed internally after initialization. Making it private (config instead of Config) would reduce the public API surface.


279-280: TODO comment should reference an issue.

The // nolint: cyclop // todo comment should reference a tracking issue per coding guidelines.


496-499: Misleading error message when returning parse error.

Line 498 returns fmt.Errorf("error updating resume token: %w", err) but err is from parsing the health event, not from updating the resume token. This will confuse debugging.

πŸ› Proposed fix
 		if markErr := watcherInstance.MarkProcessed(context.Background(), eventWithToken.ResumeToken); markErr != nil {
 			metrics.ProcessingErrors.WithLabelValues("mark_processed_error", "unknown").Inc()
 			slog.Error("Error updating resume token", "error", markErr)
 		}

-		return result, fmt.Errorf("error updating resume token: %w", err)
+		return result, fmt.Errorf("error parsing health event: %w", err)
 	}
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

301-328: Template path "./templates" is sensitive to working directory.

The relative path ./templates may cause test failures depending on where go test is invoked from. Consider resolving the path relative to the test file location using runtime.Caller.


370-371: TODO comments should reference issues.

Multiple TODO comments in this test file (lines 370-371, 432-433, 490-492, 521-522, 573-575) lack issue references. As per coding guidelines, TODO comments should reference tracking issues.

🧹 Nitpick comments (4)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)

23-25: Add documentation and name interface method parameters.

The exported interface and method lack documentation, and parameters are unnamed. Per coding guidelines, exported functions require comments, and named parameters improve API clarity.

πŸ“ Proposed documentation and parameter naming
+// CRStatusCheckerInterface determines whether a new Custom Resource should be created
+// based on the status of existing CRs.
 type CRStatusCheckerInterface interface {
-	ShouldSkipCRCreation(context.Context, string, string) bool
+	// ShouldSkipCRCreation checks if CR creation should be skipped for the given resource and node.
+	// Returns true if an existing CR is in progress or creation should be skipped, false otherwise.
+	ShouldSkipCRCreation(ctx context.Context, resourceName string, nodeName string) bool
 }
fault-remediation/pkg/annotation/annotation_interface.go (1)

24-27: Consider adding a namespace prefix to the annotation key.

The annotation key latestFaultRemediationState lacks a domain prefix (e.g., dgxc.nvidia.com/). This could lead to naming collisions with other tools or operators that might use similar annotation names.

♻️ Suggested change
 const (
 	// AnnotationKey is the key for the node annotation that tracks remediation state
-	AnnotationKey = "latestFaultRemediationState"
+	AnnotationKey = "dgxc.nvidia.com/latestFaultRemediationState"
 )
fault-remediation/pkg/remediation/remediation.go (1)

146-151: Consider using missingkey=error template option.

The default text/template behavior silently renders missing keys as <no value>. Setting Option("missingkey=error") will cause template execution to fail explicitly when required data is missing.

♻️ Suggested change
-	tmpl := template.New(templateName)
+	tmpl := template.New(templateName).Option("missingkey=error")
 
 	tmpl, err = tmpl.Parse(string(templateContent))
fault-remediation/pkg/reconciler/reconciler_test.go (1)

44-55: Mock ignores eventId parameter in RunLogCollectorJob.

The mock function signature at line 44 doesn't include eventId, but the method at line 53 accepts it. The parameter is silently ignored at line 54, which means tests cannot verify that eventId is correctly passed through.

♻️ Suggested fix to capture eventId
 type MockK8sClient struct {
 	createMaintenanceResourceFn func(ctx context.Context, healthEventData *events.HealthEventData) (string, error)
-	runLogCollectorJobFn        func(ctx context.Context, nodeName string) (ctrl.Result, error)
+	runLogCollectorJobFn        func(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error)
 	annotationManagerOverride   annotation.NodeAnnotationManagerInterface
 	realStatusChecker           crstatus.CRStatusCheckerInterface
 }
 func (m *MockK8sClient) RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error) {
-	return m.runLogCollectorJobFn(ctx, nodeName)
+	return m.runLogCollectorJobFn(ctx, nodeName, eventId)
 }

Then update all test callsites to include the eventId parameter in the mock function.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 4b221aa and 6012a61.

β›” Files ignored due to path filters (1)
  • commons/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (31)
  • .gitignore
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/go.mod
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/annotation.go
  • fault-remediation/pkg/reconciler/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
  • tilt/csp-api-mock/Tiltfile
πŸ’€ Files with no reviewable changes (8)
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/pkg/reconciler/annotation_test.go
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • fault-remediation/pkg/reconciler/remediation_test.go
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • fault-remediation/pkg/reconciler/annotation.go
  • fault-remediation/pkg/reconciler/remediation.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • tilt/csp-api-mock/Tiltfile
  • fault-remediation/pkg/events/health_event.go
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/metrics/metrics.go
  • .gitignore
🧰 Additional context used
πŸ““ Path-based instructions (3)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
**/go.mod

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

  • fault-remediation/go.mod
🧠 Learnings (32)
πŸ““ Common learnings
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/annotation/annotation.go:131-166
Timestamp: 2026-01-15T18:23:48.147Z
Learning: In fault-remediation/pkg/annotation/annotation.go, the node annotation for remediation state is designed with the assumption that only one controller should be acting on a given equivalence group at a time. Concurrent modifications to the same part of the node annotation aren't expected in normal operation.
πŸ“š Learning: 2026-01-15T18:23:48.147Z
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/annotation/annotation.go:131-166
Timestamp: 2026-01-15T18:23:48.147Z
Learning: In fault-remediation/pkg/annotation/annotation.go, the node annotation for remediation state is designed with the assumption that only one controller should be acting on a given equivalence group at a time. Concurrent modifications to the same part of the node annotation aren't expected in normal operation.

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ“š Learning: 2026-01-15T18:16:14.309Z
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/annotation/annotation.go:50-57
Timestamp: 2026-01-15T18:16:14.309Z
Learning: In fault-remediation/pkg/annotation/annotation.go, corrupt remediation state annotations are intentionally treated as empty state (returning an empty RemediationStateAnnotation) rather than returning an error. This graceful degradation prevents node workflows from getting stuck if manual actions or other issues corrupt the annotation. The unmarshal error is still logged for visibility.

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/annotation/annotation_interface.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/annotation/annotation_interface.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2026-01-15T18:25:15.442Z
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/remediation/remediation.go:469-504
Timestamp: 2026-01-15T18:25:15.442Z
Learning: When handling Kubernetes Jobs, if batch/v1 JobComplete is true, Job.Status.StartTime is guaranteed to be non-nil by the API. Therefore, in remediation.go (and similar code paths), you can omit a nil check for StartTime when the Complete condition is true. Keep nil checks only for scenarios where StartTime may legitimately be absent (e.g., before the Job starts).

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2026-01-09T18:55:38.501Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml:20-28
Timestamp: 2026-01-09T18:55:38.501Z
Learning: The janitor-provider gRPC service only requires get/list/watch permissions on nodes in its ClusterRole. It reads node metadata and delegates to CSP APIs. The janitor controller (separate component) performs actual Kubernetes node modifications including deletion and has its own RBAC configuration with appropriate write permissions.

Applied to files:

  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2026-01-14T02:33:07.679Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 689
File: janitor/pkg/controller/rebootnode_controller_test.go:371-436
Timestamp: 2026-01-14T02:33:07.679Z
Learning: In the NVSentinel janitor controller tests, tests that demonstrate original bugs or issues that were fixed by a PR should be kept for posterity, even if they reference removed functionality like MaxRebootRetries or RetryCount fields. These historical test cases serve as documentation of what problem was being solved.

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/go.mod
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code

Applied to files:

  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
πŸ“š Learning: 2026-01-12T05:13:24.947Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:24.947Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.

Applied to files:

  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2026-01-06T21:31:36.113Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.

Applied to files:

  • fault-remediation/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

  • fault-remediation/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

  • fault-remediation/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks

Applied to files:

  • fault-remediation/go.mod
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
🧬 Code graph analysis (10)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (38-40)
  • AnnotationKey (26-26)
  • EquivalenceGroupState (43-49)
fault-remediation/pkg/crstatus/crstatus_test.go (1)
fault-remediation/pkg/crstatus/checker.go (1)
  • NewCRStatusChecker (34-44)
fault-remediation/pkg/remediation/remediation_test.go (2)
fault-remediation/pkg/config/config.go (1)
  • Template (47-50)
fault-remediation/pkg/remediation/remediation.go (2)
  • NewRemediationClient (71-129)
  • FaultRemediationClient (58-69)
fault-remediation/pkg/initializer/init.go (3)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewRemediationClient (71-129)
commons/pkg/statemanager/statemanager.go (2)
  • NewStateManager (206-210)
  • StateManager (197-200)
store-client/pkg/datastore/config.go (1)
  • LoadDatastoreConfig (27-44)
fault-remediation/pkg/annotation/annotation_test.go (2)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (38-40)
  • AnnotationKey (26-26)
  • EquivalenceGroupState (43-49)
fault-remediation/pkg/annotation/annotation.go (1)
  • NodeAnnotationManager (33-35)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (3)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (26-29)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (30-35)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (23-25)
fault-remediation/pkg/remediation/remediation.go (7)
fault-remediation/pkg/config/config.go (2)
  • Template (47-50)
  • MaintenanceResource (27-44)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (30-35)
fault-remediation/pkg/crstatus/checker.go (2)
  • CRStatusChecker (28-32)
  • NewCRStatusChecker (34-44)
fault-remediation/pkg/annotation/annotation.go (1)
  • NewNodeAnnotationManager (38-42)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (23-25)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (26-29)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • TemplateData (38-52)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (27-44)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (5)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (41-43)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewRemediationClient (71-129)
commons/pkg/statemanager/statemanager.go (3)
  • NewStateManager (206-210)
  • StateManager (197-200)
  • RemediatingLabelValue (171-171)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • AnnotationKey (26-26)
fault-remediation/pkg/metrics/metrics.go (4)
  • TotalEventsReceived (33-38)
  • EventHandlingDuration (62-68)
  • EventsProcessed (39-45)
  • ProcessingErrors (46-52)
fault-remediation/pkg/reconciler/reconciler_test.go (4)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventData (26-29)
  • HealthEventDoc (20-23)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • NodeAnnotationManagerInterface (30-35)
  • RemediationStateAnnotation (38-40)
  • EquivalenceGroupState (43-49)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (23-25)
data-models/pkg/protos/health_event.pb.go (2)
  • RecommendedAction_RESTART_BM (101-101)
  • RecommendedAction_NONE (96-96)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (39)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml (2)

67-71: LGTM: watch verb is required for controller-runtime.

Adding watch to the nodes resource is expected when migrating to controller-runtime, as it uses informers/watches to receive events about resource changes rather than polling.


83-89: The update permission on Jobs is required and properly used.

The controller updates Job annotations in three locations (fault-remediation/pkg/remediation/remediation.go lines 505, 545, 597) to track whether metrics have already been counted for a given Job execution state. Specifically, it adds the jobMetricsAlreadyCountedAnnotation annotation after detecting job completion, failure, or timeout. This deduplication mechanism prevents duplicate metrics recording when the reconciliation function is called multiple times for the same Job state transition.

This is a legitimate use of the update verb since Job specs are immutable but metadata (annotations) can be modified.

fault-remediation/go.mod (1)

75-75: LGTM!

Moving golang.org/x/sync to indirect dependency is appropriate as the codebase now relies on controller-runtime's internal usage rather than direct imports.

fault-remediation/pkg/crstatus/crstatus_test.go (1)

33-34: LGTM!

The constructor call correctly aligns with the updated NewCRStatusChecker signature that now accepts (client, remediationActions, dryRun). Passing nil for the client is acceptable here since checkCondition only inspects the CR data structure without making API calls.

fault-remediation/pkg/annotation/annotation_test.go (4)

29-134: Well-structured table-driven test with good coverage.

The test covers key scenarios: node not found, missing annotation, invalid JSON annotation, and valid state parsing. Using Unix() for timestamp comparison is a pragmatic choice to avoid precision issues.


136-160: LGTM!

Test properly verifies the update flow with appropriate error assertions for both the update and subsequent read operations.


162-185: LGTM!

Test correctly verifies that ClearRemediationState removes the annotation from the node. Error assertions are properly in place.


187-221: LGTM!

Test correctly validates selective group removal while preserving unaffected groups. Error assertions are properly included.

fault-remediation/pkg/remediation/remediation_test.go (2)

113-207: LGTM!

The E2E tests for missing/empty template file configurations are well-structured and properly validate error messages. Good use of t.TempDir() for test isolation.


209-358: LGTM!

Well-structured table-driven test covering error, success, and dry-run scenarios. The test properly validates that CRs are created in non-dry-run mode and not persisted in dry-run mode.

fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)

37-52: LGTM!

The TemplateData struct is well-organized with clear field groupings and a helpful comment explaining CRD routing metadata. The fields align well with the template rendering needs.

fault-remediation/main.go (2)

42-45: LGTM!

Good practice to register schemes in init() for core/v1 and batch/v1 types needed by the controller.


105-175: LGTM!

The controller runtime management setup is well-structured:

  • Proper config wrapping with audit logger
  • Health/ready checks registered
  • Clean initialization flow via InitializeAll
  • Appropriate defer for cleanup
fault-remediation/pkg/initializer/init.go (1)

52-152: LGTM!

The initialization flow is well-structured:

  • Clean separation between remediation client and state manager creation
  • Proper error wrapping with context
  • Configuration validation before use
  • Good use of the factory pattern for watcher creation
fault-remediation/pkg/annotation/annotation.go (5)

15-18: LGTM!

Good package-level documentation that clearly describes the purpose of the annotation package.


67-78: LGTM!

Good implementation of graceful degradation - corrupt annotations are treated as empty state to prevent workflows from getting stuck. The nil map check at lines 76-78 properly handles the case where JSON lacks equivalenceGroups. Based on learnings, this is the intended design.


83-123: LGTM!

The UpdateRemediationState implementation correctly uses DeepCopy before mutation and properly initializes annotations map if nil. Based on learnings, the annotation is designed with the assumption that only one controller acts on a given equivalence group at a time, so the read-modify-write pattern without explicit retry is acceptable for this use case.


125-150: LGTM!

Clean implementation of ClearRemediationState with proper nil checks and DeepCopy usage.


152-187: LGTM!

Good implementation of RemoveGroupFromState with optimization to clear entire annotation when no groups remain. The implementation correctly delegates to ClearRemediationState in that case.

fault-remediation/pkg/annotation/annotation_interface.go (1)

29-49: Interface and type definitions look good.

The NodeAnnotationManagerInterface provides a clean abstraction for managing node remediation state. The struct types RemediationStateAnnotation and EquivalenceGroupState are well-defined with appropriate JSON tags for serialization.

fault-remediation/pkg/crstatus/checker.go (2)

46-76: ShouldSkipCRCreation logic is correct with namespace fix applied.

The method now correctly includes the namespace in the ObjectKey lookup (line 68), addressing the previous concern about namespaced CR lookups. The GVK is properly set on the unstructured object before the Get call.

The current error handling (lines 70-72) treats any Get error as "allow create". This favors availability over consistency, which appears intentional based on past discussion. If transient errors become a concern, consider checking for IsNotFound specifically.


78-113: Helper methods are well-implemented.

The checkCondition, findConditionStatus, and isTerminal methods correctly parse the unstructured object's status conditions to determine terminal states.

fault-remediation/pkg/remediation/remediation.go (3)

71-129: NewRemediationClient implementation is well-structured.

The constructor properly validates configuration, pre-loads templates, and initializes dependencies. Good defensive checks for missing template mount path and namespace configuration for namespaced resources.


165-262: CreateMaintenanceResource flow is correct.

The method properly handles:

  • Dry-run mode bypass
  • Template selection and rendering
  • Node owner reference setup
  • AlreadyExists error handling
  • Annotation state updates

The YAML logging at line 228-230 uses Debug level which limits exposure.


440-483: checkLogCollectorStatus correctly separates error and found states.

The refactored logic properly returns nil error on found==true and only wraps errors when err != nil, addressing the previous bug where found || err != nil incorrectly treated success as failure.

fault-remediation/pkg/reconciler/reconciler.go (3)

172-194: runLogCollector correctly handles result and error separately.

The refactored logic properly returns (result, nil) when err == nil, avoiding the previous bug where non-zero results with nil errors were incorrectly wrapped.


196-246: performRemediation error flow is correct.

The method properly:

  • Updates state to "remediating" first
  • Creates the maintenance resource
  • Updates the final label (succeeded/failed)
  • Returns joined errors when both CR creation and label update fail
  • Returns the CR name on success

329-361: handleRemediationEvent correctly separates log collector result from error.

Lines 329-336 properly check err != nil and !result.IsZero() separately, fixing the previous issue where requeue requests were incorrectly treated as errors.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)

168-179: Test setup correctly uses envtest and controller-runtime client.

The test environment properly initializes envtest, creates a controller-runtime manager, and obtains the client via mgr.GetClient(). This aligns with the coding guidelines to use envtest for testing Kubernetes controllers.


331-401: Test case FirstEvent_CreatesAnnotation is well-structured.

The test properly:

  • Cleans up node annotations before running
  • Creates a remediation client and reconciler
  • Exercises the performRemediation path
  • Verifies annotation creation and CR existence
  • Cleans up created resources

732-906: Full reconciler E2E test provides comprehensive coverage.

The TestFullReconcilerWithMockedMongoDB_E2E test covers the complete flow including:

  • CR creation from quarantine events
  • Deduplication of duplicate events
  • Annotation cleanup on unquarantine
  • Metrics verification

Good use of assert.Eventually for async operations.

fault-remediation/pkg/reconciler/reconciler_test.go (8)

17-39: LGTM!

Import organization follows Go conventions and includes the necessary packages for the updated interfaces (annotation, events, ctrl, corev1).


112-144: LGTM!

The MockNodeAnnotationManager correctly implements the updated NodeAnnotationManagerInterface with the new GetRemediationState signature returning (*annotation.RemediationStateAnnotation, *corev1.Node, error).


200-244: LGTM!

Well-structured table-driven test following Go testing conventions. The mock and assertions correctly validate reconciler creation with the updated interface signatures.


246-501: LGTM!

The remediation tests are well-structured with proper coverage of success, failure, and edge cases. The migration to events.HealthEventData and events.HealthEventDoc is correctly implemented, including the type conversions in test setup.


503-570: LGTM!

Comprehensive table-driven test covering the expected skip scenarios for different recommended actions.


818-954: LGTM!

The CR-based deduplication tests provide good coverage for various scenarios including no existing CR, succeeded CR, failed CR, and cross-action remediation with equivalence groups.


956-1019: LGTM!

Good regression test for Issue #441, ensuring log collector jobs are only triggered when shouldCreateCR is true. The test documentation clearly explains the fix rationale.


85-110: Unable to verify the interface implementation claims due to missing codebase context.

The review references code at fault-remediation/pkg/reconciler/reconciler_test.go (lines 85-110) with types MockCRStatusChecker, TestCRStatusChecker, and MockCRStatusCheckerWrapper, but these references cannot be located in the codebase through searches for the method signatures IsSuccessful or ShouldSkipCRCreation, or the type names themselves.

Without access to the actual code being reviewed, the claims about interface mismatches and orphaned mock types cannot be confirmed or refuted. The file path, code snippets, and interface definitions referenced in the review need to be re-verified to ensure they match the current state of the repository.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 636 to 643
{
name: "Log collector job with api error",
nodeName: "test-node-api-error",
jobResult: false,
expectedResult: false,
description: "Error path - kubernetes API error during job creation",
},
{
name: "Log collector job with creation error",
nodeName: "test-node-create-error",
jobResult: false,
expectedResult: false,
description: "Error path - job creation fails",
},
{
name: "Log collector job timeout",
nodeName: "test-node-timeout",
name: "Log collector job return requeue",
nodeName: "test-node-fail",
jobResult: false,
expectedResult: false,
description: "Error path - job times out",
description: "Error path - job fails to complete",
returnedResult: ctrl.Result{RequeueAfter: 5 * time.Minute},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Test case has duplicate values from previous case.

The "Log collector job return requeue" test case reuses nodeName: "test-node-fail" and description: "Error path - job fails to complete" from the previous test case. These should be unique for clarity and accurate test reporting.

πŸ”§ Suggested fix
 		{
 			name:           "Log collector job return requeue",
-			nodeName:       "test-node-fail",
+			nodeName:       "test-node-requeue",
 			jobResult:      false,
 			expectedResult: false,
-			description:    "Error path - job fails to complete",
+			description:    "Requeue path - job returns requeue result",
 			returnedResult: ctrl.Result{RequeueAfter: 5 * time.Minute},
 		},
πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{
name: "Log collector job with api error",
nodeName: "test-node-api-error",
jobResult: false,
expectedResult: false,
description: "Error path - kubernetes API error during job creation",
},
{
name: "Log collector job with creation error",
nodeName: "test-node-create-error",
jobResult: false,
expectedResult: false,
description: "Error path - job creation fails",
},
{
name: "Log collector job timeout",
nodeName: "test-node-timeout",
name: "Log collector job return requeue",
nodeName: "test-node-fail",
jobResult: false,
expectedResult: false,
description: "Error path - job times out",
description: "Error path - job fails to complete",
returnedResult: ctrl.Result{RequeueAfter: 5 * time.Minute},
},
{
name: "Log collector job return requeue",
nodeName: "test-node-requeue",
jobResult: false,
expectedResult: false,
description: "Requeue path - job returns requeue result",
returnedResult: ctrl.Result{RequeueAfter: 5 * time.Minute},
},
πŸ€– Prompt for AI Agents
In `@fault-remediation/pkg/reconciler/reconciler_test.go` around lines 636 - 643,
The test case "Log collector job return requeue" in reconciler_test.go
duplicates the previous case's nodeName and description which hurts clarity;
update the test case values in the table (the struct literal used for the
table-driven tests) – change nodeName from "test-node-fail" to a unique
identifier like "test-node-requeue" and change description from "Error path -
job fails to complete" to something unique like "Error path - job returns
requeue" so the case is distinct when running tests and in reports; locate the
table entries in the test function (the slice of test structs used by Test... in
reconciler_test.go) and edit the fields for that specific case.

Comment on lines +43 to +111
func TestNewRemediationClient(t *testing.T) {
tests := []struct {
name string
client client.Client
dryRun bool
wantErr bool
templateDir string
}{
{
name: "file does not exist",
templateDir: "does-not-exist",
dryRun: false,
wantErr: true,
},
{
name: "file does exist",
templateDir: "templates",
dryRun: false,
wantErr: false,
},
{
name: "file does exist & dry-run",
templateDir: "templates",
dryRun: true,
wantErr: false,
},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
testConfig := config.TomlConfig{
Template: config.Template{
MountPath: tt.templateDir,
},
RemediationActions: map[string]config.MaintenanceResource{
protos.RecommendedAction_RESTART_BM.String(): {
Namespace: "dgxc-janitor",
Version: "v1alpha1",
ApiGroup: "janitor.dgxc.nvidia.com",
Kind: "RebootNode",
CompleteConditionType: "NodeReady",
TemplateFileName: "rebootnode-template.yaml",
},
protos.RecommendedAction_COMPONENT_RESET.String(): {
Namespace: "dgxc-janitor",
Version: "v1alpha1",
ApiGroup: "janitor.dgxc.nvidia.com",
Kind: "RebootNode",
CompleteConditionType: "NodeReady",
TemplateFileName: "rebootnode-template.yaml",
},
},
}
result, err := NewRemediationClient(tt.client, tt.dryRun, testConfig)
if tt.wantErr {
assert.Error(t, err)
assert.Nil(t, result)
} else {
assert.NoError(t, err)
assert.NotNil(t, result)
if tt.dryRun {
assert.Equal(t, []string{metav1.DryRunAll}, result.dryRunMode)
} else {
assert.Empty(t, result.dryRunMode)
}
}
})
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Initialize a real fake client in TestNewRemediationClient.

The tt.client field is declared but never populated in the test cases, so NewRemediationClient(tt.client, ...) always receives nil. While this may work if the constructor tolerates a nil client during template loading, it makes the test brittle and doesn't validate actual client wiring.

πŸ”§ Proposed fix
 func TestNewRemediationClient(t *testing.T) {
 	tests := []struct {
 		name        string
-		client      client.Client
 		dryRun      bool
 		wantErr     bool
 		templateDir string
 	}{
 		// ... test cases unchanged ...
 	}

 	for _, tt := range tests {
 		t.Run(tt.name, func(t *testing.T) {
+			fakeClient := fake.NewClientBuilder().Build()
 			testConfig := config.TomlConfig{
 				// ... config unchanged ...
 			}
-			result, err := NewRemediationClient(tt.client, tt.dryRun, testConfig)
+			result, err := NewRemediationClient(fakeClient, tt.dryRun, testConfig)
πŸ€– Prompt for AI Agents
In `@fault-remediation/pkg/remediation/remediation_test.go` around lines 43 - 111,
TestNewRemediationClient declares tt.client but never sets it, passing nil into
NewRemediationClient and making the test brittle; create a real fake
controller-runtime client and assign it to tt.client before calling
NewRemediationClient. Specifically, in the test setup for
TestNewRemediationClient register the needed API types into a scheme (the same
scheme NewRemediationClient expects), build a fake client (e.g., with
controller-runtime's fake client builder) and assign it to tt.client so each
case uses a usable client instance; then call NewRemediationClient(tt.client,
tt.dryRun, testConfig) as before and assert results.

@KaivalyaMDabhadkar
Copy link
Contributor

KaivalyaMDabhadkar commented Jan 16, 2026

/ok to test 75cfdc0

1 similar comment
@KaivalyaMDabhadkar
Copy link
Contributor

/ok to test 75cfdc0

@github-actions
Copy link

Merging this branch changes the coverage (2 decrease, 2 increase)

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/fault-remediation 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation 32.12% (+32.12%) 🌟
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus 30.58% (-12.37%) πŸ’€
github.com/nvidia/nvsentinel/fault-remediation/pkg/events 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 21.37% (-4.19%) πŸ‘Ž
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation 28.16% (+28.16%) 🌟

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/fault-remediation/main.go 0.00% (ΓΈ) 433 (-8) 0 433 (-8)
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation.go 32.12% (+32.12%) 358 (+358) 115 (+115) 243 (+243) 🌟
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation_interface.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/checker.go 30.58% (-12.37%) 206 (+57) 63 (-1) 143 (+58) πŸ’€
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/crstatus_interface.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/events/health_event.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer/init.go 0.00% (ΓΈ) 248 (+88) 0 248 (+88)
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics/metrics.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/annotation.go 0.00% (-24.29%) 0 (-280) 0 (-68) 0 (-212) πŸ’€ πŸ’€
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler.go 21.37% (-10.03%) 1287 (+427) 275 (+5) 1012 (+422) πŸ’€
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation.go 0.00% (-20.48%) 0 (-918) 0 (-188) 0 (-730) πŸ’€ πŸ’€
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/fault_remediation_client_interface.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation.go 28.16% (+28.16%) 1289 (+1289) 363 (+363) 926 (+926) 🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/crstatus_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/annotation_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation_test.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

πŸ€– Fix all issues with AI agents
In `@fault-remediation/pkg/reconciler/reconciler_test.go`:
- Around line 498-500: The comment in reconciler_test.go contradicts the
assertion: the test comment says remediation "should still succeed" but the test
asserts an error for r.performRemediation(ctx, healthEventDoc); either change
the comment to state that remediation should return an error or change the
assertion to assert.NoError(t, err) to match the "should succeed" expectation;
update the line containing r.performRemediation(...) and its surrounding comment
so the comment and the assertion (assert.Error vs assert.NoError) consistently
reflect the intended behavior.

In `@fault-remediation/pkg/reconciler/reconciler.go`:
- Around line 278-280: The TODO in the comment above the handleRemediationEvent
function on the FaultRemediationReconciler needs an issue reference per repo
guidelines; update the comment to replace "// nolint: cyclop // todo" with a
properly formatted tracker reference (for example "// nolint: cyclop // TODO:
track in ISSUE-<number>" or similar) and ensure the referenced issue ID/URL
exists and describes the cyclomatic complexity refactor for
handleRemediationEvent so reviewers can find the ticket.
♻️ Duplicate comments (23)
.gitignore (1)

143-143: Redundant .idea/ patterns remain.

The blanket .idea/ entry at line 143 ignores the entire .idea directory, making all the specific .idea/* patterns below (lines 149–197) redundant. These include .idea/replstate.xml, .idea/sonarlint/, .idea/httpRequests, .idea/codestream.xml, and others.

Choose one approach:

  1. Remove all specific .idea/* entries (lines 149, 164, 167–168, 171, 180, 183, 186–188, 191, 194, 197), OR
  2. Remove the general .idea/ entry at line 143 and keep the granular rules
🧹 Option 1: Remove redundant specific patterns (recommended)
 .idea/
 
 # CMake
 cmake-build-*/
 
-# Mongo Explorer plugin
-.idea/**/mongoSettings.xml
-
 # File-based project format
 *.iws
 
 # IntelliJ output
 out/
 
 # mpeltonen/sbt-idea plugin
 .idea_modules/
 
 # JIRA plugin
 atlassian-ide-plugin.xml
 
-# Cursive Clojure plugin
-.idea/replstate.xml
-
-# SonarLint plugin
-.idea/sonarlint/
-.idea/**/sonarlint/
-
-# SonarQube Plugin
-.idea/**/sonarIssues.xml
-
 # Crashlytics plugin
 com_crashlytics_export_strings.xml
 crashlytics.properties
 crashlytics-build.properties
 fabric.properties
 
-# Editor-based Rest Client
-.idea/httpRequests
-
-# Android studio serialized cache file
-.idea/caches/build_file_checksums.ser
-
-# Markdown Navigator plugin
-.idea/**/markdown-navigator.xml
-.idea/**/markdown-navigator-enh.xml
-.idea/**/markdown-navigator/
-
-# Cache file creation bug
-.idea/$CACHE_FILE$
-
-# CodeStream plugin
-.idea/codestream.xml
-
-# Azure Toolkit for IntelliJ plugin
-.idea/**/azureSettings.xml
-
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

15-20: Quote template placeholders to satisfy YAML static analysis.

The unquoted Go template placeholders cause YAML syntax errors during static analysis. While the rendered output will be valid YAML, quoting the placeholders makes the template file itself parseable.

Suggested fix
-apiVersion: {{.ApiGroup}}/{{.Version}}
+apiVersion: "{{.ApiGroup}}/{{.Version}}"
 kind: RebootNode
 metadata:
-  name: maintenance-{{.NodeName}}-{{.HealthEventID}}
+  name: "maintenance-{{.NodeName}}-{{.HealthEventID}}"
 spec:
-  nodeName: {{.NodeName}}
+  nodeName: "{{.NodeName}}"
fault-remediation/main.go (1)

113-114: TODO comment should reference an issue.

Per coding guidelines, TODO comments should reference a GitHub issue for tracking.

-	//TODO: setup informers for node and job
+	// TODO(#<issue-number>): setup informers for node and job
fault-remediation/pkg/remediation/templates/log-collector-job.yaml (2)

18-20: Hardcoded namespace test will not work in production.

The namespace is hardcoded to test. This should be parameterized or set programmatically at runtime for the template to work across different environments.

Suggested fix - parameterize namespace
 metadata:
   generateName: log-collector-job-
-  namespace: test
+  namespace: {{.Namespace}}

Ensure the code that renders this template supplies the Namespace field in the template data.


30-32: Placeholder image test:test will fail at runtime.

The container image is set to a placeholder value that doesn't exist. This needs to reference a real log-collector image, either hardcoded to a production image or parameterized via template variable.

Suggested fix - parameterize image
       containers:
         - name: log-collector
-          image: test:test
+          image: {{.Image}}
+          imagePullPolicy: {{.ImagePullPolicy}}

Alternatively, reference the actual production image directly if it's known.

fault-remediation/pkg/events/health_event.go (1)

15-17: Add package-level documentation.

The package lacks a godoc comment describing its purpose. As per coding guidelines, package-level godoc is required for all Go packages.

Proposed documentation
+// Package events provides data structures for representing health events
+// with support for different serialization formats (JSON for document storage
+// and BSON for MongoDB compatibility).
 package events
fault-remediation/pkg/crstatus/crstatus_interface.go (1)

23-25: Add documentation and name interface method parameters.

The exported interface and its method lack documentation, and the method parameters are unnamed. Named parameters improve API clarity and self-documenting code.

πŸ“ Proposed documentation and parameter naming
+// CRStatusCheckerInterface determines whether a new Custom Resource should be created
+// based on the status of existing CRs.
 type CRStatusCheckerInterface interface {
+	// ShouldSkipCRCreation checks if CR creation should be skipped for the given resource and node.
+	// Returns true if a CR already exists or creation should be skipped, false otherwise.
-	ShouldSkipCRCreation(context.Context, string, string) bool
+	ShouldSkipCRCreation(ctx context.Context, resourceName string, nodeName string) bool
 }

As per coding guidelines, function comments are required for all exported Go functions.

fault-remediation/pkg/remediation/remediation_test.go (3)

43-111: Initialize a real fake client in TestNewRemediationClient.

The tt.client field is declared but never populated in the test cases, so NewRemediationClient(tt.client, ...) always receives nil. While this may work if the constructor tolerates a nil client during template loading, it makes the test brittle and doesn't validate actual client wiring.

πŸ”§ Proposed fix
 func TestNewRemediationClient(t *testing.T) {
 	tests := []struct {
 		name        string
-		client      client.Client
 		dryRun      bool
 		wantErr     bool
 		templateDir string
 	}{
 		// ... test cases unchanged ...
 	}

 	for _, tt := range tests {
 		t.Run(tt.name, func(t *testing.T) {
+			fakeClient := fake.NewClientBuilder().Build()
 			testConfig := config.TomlConfig{
 				// ... config unchanged ...
 			}
-			result, err := NewRemediationClient(tt.client, tt.dryRun, testConfig)
+			result, err := NewRemediationClient(fakeClient, tt.dryRun, testConfig)

391-395: Dry-run test case missing templateDir will fail before reaching dry-run logic.

This test case doesn't set templateDir, so RunLogCollectorJob will fail when trying to load the template, not due to dry-run behavior. This doesn't test the intended dry-run skip scenario.

πŸ”§ Proposed fix
 		{
 			name:          "Skip creation with dry run",
 			dryRun:        true,
+			templateDir:   "templates",
 			expectedError: false,
 		},

540-540: Swap assert.Equal argument order (expected, actual).

assert.Equal(t, result.RequeueAfter, tt.requeueTime) should be assert.Equal(t, tt.requeueTime, result.RequeueAfter) for clearer failure messages. The testify convention is expected value first, actual value second.

πŸ”§ Proposed fix
-			assert.Equal(t, result.RequeueAfter, tt.requeueTime)
+			assert.Equal(t, tt.requeueTime, result.RequeueAfter)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)

15-15: Add package-level documentation.

The package declaration lacks a package-level godoc comment.

πŸ“ Suggested package documentation
+// Package remediation provides interfaces and types for fault remediation operations,
+// including maintenance resource creation, log collection, and node annotation management.
 package remediation

As per coding guidelines, package-level godoc is required for all Go packages.


29-35: Add godoc for the exported interface and its methods.

The FaultRemediationClientInterface and its methods lack documentation.

πŸ“ Suggested documentation
+// FaultRemediationClientInterface defines the contract for fault remediation operations,
+// including CR creation, log collection, and state management.
 type FaultRemediationClientInterface interface {
+	// CreateMaintenanceResource creates a maintenance CR for the given health event and returns the CR name.
 	CreateMaintenanceResource(ctx context.Context, healthEventData *events.HealthEventData) (string, error)
+	// RunLogCollectorJob orchestrates log collection for a node and event, returning a reconcile result.
 	RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error)
+	// GetAnnotationManager returns the annotation manager for node remediation state tracking.
 	GetAnnotationManager() annotation.NodeAnnotationManagerInterface
+	// GetStatusChecker returns the CR status checker for determining CR creation eligibility.
 	GetStatusChecker() crstatus.CRStatusCheckerInterface
+	// GetConfig returns the remediation configuration.
 	GetConfig() *config.TomlConfig
 }

As per coding guidelines, function comments are required for all exported Go functions.

fault-remediation/pkg/annotation/annotation.go (2)

57-57: TODO should reference an issue.

Per coding guidelines, TODO comments should reference issues in Go code.

πŸ“ Suggested fix
-	// TODO: maybe split this up so it's not returning both node and state
+	// TODO(`#XXX`): Consider splitting this to not return both node and state

180-181: Error message says "patch" but operation is "Update".

Line 180 uses m.client.Update but the error message says "failed to patch node annotation". For consistency, update the error message to match the operation.

πŸ”§ Proposed fix
 	if err = m.client.Update(ctx, updatedNode); err != nil {
-		return fmt.Errorf("failed to patch node annotation: %w", err)
+		return fmt.Errorf("failed to update node annotation: %w", err)
 	}
fault-remediation/pkg/initializer/init.go (1)

69-69: Minor: Inconsistent capitalization in error message.

The error message uses "toml Config" with capital C, which is inconsistent with typical error message conventions.

πŸ“ Suggested fix
-		return nil, fmt.Errorf("error while loading the toml Config: %w", err)
+		return nil, fmt.Errorf("error while loading the toml config: %w", err)
fault-remediation/pkg/annotation/annotation_interface.go (1)

15-15: Add package-level godoc for annotation package.

Required by repo Go guidelines so godoc/lint recognizes the package.

βœ… Suggested fix
+// Package annotation defines remediation-state node annotation types and interfaces.
 package annotation
fault-remediation/pkg/crstatus/checker.go (2)

34-38: Add godoc for exported constructor.

Repo guidelines require comments on exported functions.

βœ… Suggested fix
+// NewCRStatusChecker creates a CRStatusChecker using the provided client,
+// remediation actions, and dry-run mode.
 func NewCRStatusChecker(
 	client client.Client,
 	remediationActions map[string]config.MaintenanceResource,
 	dryRun bool,
 ) *CRStatusChecker {

68-72: Differentiate NotFound vs other errors to avoid duplicate CRs.

Treating all Get errors as β€œallow create” can create duplicates during transient/API permission failures.

πŸ”§ Suggested fix
+	"k8s.io/apimachinery/pkg/api/errors"
 ...
-	if err := c.client.Get(ctx, key, obj); err != nil {
-		slog.Warn("Failed to get CR, allowing create", "crName", crName, "gvk", gvk.String(), "error", err)
-		return false
-	}
+	if err := c.client.Get(ctx, key, obj); err != nil {
+		if errors.IsNotFound(err) {
+			slog.Debug("CR not found, allowing create", "crName", crName, "gvk", gvk.String())
+			return false
+		}
+		slog.Error("Failed to get CR status, skipping create", "crName", crName, "gvk", gvk.String(), "error", err)
+		return true
+	}
fault-remediation/pkg/reconciler/reconciler.go (2)

219-246: CR creation success can be overwritten by label-update error.

If CreateMaintenanceResource succeeds but the label update fails, performRemediation returns an error and callers set FaultRemediated=false, potentially leaving a created CR but a failed DB status.

Consider returning creation success separately (or ignoring label-update errors when CR creation succeeds) so downstream status reflects CR creation.

πŸ”§ Possible direction
-func (r *FaultRemediationReconciler) performRemediation(...) (string, error) {
+func (r *FaultRemediationReconciler) performRemediation(...) (crName string, created bool, err error) {
...
-	crName, createMaintenanceResourceError := ...
+	crName, createMaintenanceResourceError := ...
+	created = createMaintenanceResourceError == nil && crName != ""
...
-		return "", errors.Join(createMaintenanceResourceError, err)
+		return crName, created, errors.Join(createMaintenanceResourceError, err)
...
-	return crName, nil
+	return crName, created, nil
}

493-498: Return the correct error on parse failure.

You wrap the parse error with "error updating resume token", which is misleading when MarkProcessed succeeds.

βœ… Suggested fix
-		return result, fmt.Errorf("error updating resume token: %w", err)
+		return result, fmt.Errorf("error parsing health event: %w", err)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

301-307: Make template path resilient to working directory changes.

"./templates" is brittle in CI; resolve relative to the test file.

πŸ”§ Suggested fix
+import (
+	...
+	"runtime"
+	...
+)
...
 func createTestRemediationClient(dryRun bool) (remediation.FaultRemediationClientInterface, error) {
+	_, thisFile, _, _ := runtime.Caller(0)
+	templatesDir := filepath.Join(filepath.Dir(thisFile), "templates")
 	remediationConfig := config.TomlConfig{
 		Template: config.Template{
-			MountPath: "./templates",
+			MountPath: templatesDir,
 			FileName:  "rebootnode-template.yaml",
 		},

370-372: TODOs need issue references.

There are multiple TODOs without issue IDs (e.g., Lines 370+, 432+, 491+, 521+). Please attach tracker IDs per guidelines.

βœ… Example fix
-// TODO: ignoring error otherwise need to properly walk state transitions
+// TODO(NVS-1234): handle state transitions before asserting remediation flow
fault-remediation/pkg/reconciler/reconciler_test.go (1)

637-642: Make requeue test case unique for clearer reporting.

The requeue case duplicates nodeName/description from the previous case, which makes failures harder to interpret.

βœ… Suggested fix
 	{
 		name:           "Log collector job return requeue",
-		nodeName:       "test-node-fail",
+		nodeName:       "test-node-requeue",
 		jobResult:      false,
 		expectedResult: false,
-		description:    "Error path - job fails to complete",
+		description:    "Requeue path - job returns requeue result",
 		returnedResult: ctrl.Result{RequeueAfter: 5 * time.Minute},
 	},
🧹 Nitpick comments (3)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)

96-111: LGTM - Unified health/readiness probes.

The removal of conditional gating for health/ready probes aligns well with the unified controller-runtime flow. Both probes now consistently use the health port.

Minor style note: There's inconsistent quoting between port: health (line 99) and port: "health" (line 107). Consider standardizing to unquoted named ports for consistency.

Optional: Standardize port references
           readinessProbe:
             httpGet:
-              path:  "/readyz"
-              port: "health"
+              path: "/readyz"
+              port: health
fault-remediation/pkg/events/health_event.go (1)

25-29: Improve godoc comment for HealthEventData.

The current comment mentions "string ID for compatibility" but doesn't explain the BSON tags or MongoDB context, which would help future maintainers understand when to use this type vs HealthEventDoc.

Suggested improvement
-// HealthEventData represents health event data with string ID for compatibility
+// HealthEventData represents health event data with BSON tags for MongoDB storage compatibility.
 type HealthEventData struct {
 	ID                          string `bson:"_id,omitempty"`
 	model.HealthEventWithStatus `bson:",inline"`
 }
fault-remediation/pkg/remediation/templates/log-collector-job.yaml (1)

21-32: Consider adding resource limits and backoffLimit for production hardening.

The Job template lacks:

  • spec.backoffLimit to control retry behavior (defaults to 6)
  • Container resource requests/limits
  • Security context

These may be acceptable for a log collection job, but consider adding them for production use.

Optional production hardening
 spec:
   ttlSecondsAfterFinished: 3600
+  backoffLimit: 3
   template:
     metadata:
       labels:
         app: log-collector
     spec:
       serviceAccountName: log-collector-job
       restartPolicy: Never
       containers:
         - name: log-collector
           image: test:test
+          resources:
+            requests:
+              memory: "64Mi"
+              cpu: "100m"
+            limits:
+              memory: "256Mi"
+              cpu: "500m"
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 75cfdc0 and db379df.

β›” Files ignored due to path filters (1)
  • commons/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (31)
  • .gitignore
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/go.mod
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/annotation.go
  • fault-remediation/pkg/reconciler/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
  • tilt/csp-api-mock/Tiltfile
πŸ’€ Files with no reviewable changes (8)
  • fault-remediation/pkg/reconciler/annotation_test.go
  • fault-remediation/pkg/reconciler/remediation_test.go
  • distros/kubernetes/nvsentinel/values.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • fault-remediation/pkg/reconciler/annotation.go
  • fault-remediation/pkg/reconciler/remediation.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • fault-remediation/go.mod
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/reconciler/reconciler.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧠 Learnings (29)
πŸ““ Common learnings
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/annotation/annotation.go:131-166
Timestamp: 2026-01-15T18:23:48.147Z
Learning: In fault-remediation/pkg/annotation/annotation.go, the node annotation for remediation state is designed with the assumption that only one controller should be acting on a given equivalence group at a time. Concurrent modifications to the same part of the node annotation aren't expected in normal operation.
πŸ“š Learning: 2025-11-26T13:54:26.427Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: distros/kubernetes/nvsentinel/values-tilt.yaml:215-223
Timestamp: 2025-11-26T13:54:26.427Z
Learning: For values-tilt.yaml files: Keep documentation minimal and implementation-focused. Unlike production values.yaml files which need comprehensive inline comments, values-tilt.yaml is a Tilt-specific testing/development override file that doesn't require extensive documentation.

Applied to files:

  • tilt/csp-api-mock/Tiltfile
πŸ“š Learning: 2026-01-12T05:13:24.947Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:24.947Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/checker.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/checker.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
πŸ“š Learning: 2026-01-15T18:25:15.442Z
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/remediation/remediation.go:469-504
Timestamp: 2026-01-15T18:25:15.442Z
Learning: When handling Kubernetes Jobs, if batch/v1 JobComplete is true, Job.Status.StartTime is guaranteed to be non-nil by the API. Therefore, in remediation.go (and similar code paths), you can omit a nil check for StartTime when the Complete condition is true. Keep nil checks only for scenarios where StartTime may legitimately be absent (e.g., before the Job starts).

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2026-01-14T02:33:07.679Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 689
File: janitor/pkg/controller/rebootnode_controller_test.go:371-436
Timestamp: 2026-01-14T02:33:07.679Z
Learning: In the NVSentinel janitor controller tests, tests that demonstrate original bugs or issues that were fixed by a PR should be kept for posterity, even if they reference removed functionality like MaxRebootRetries or RetryCount fields. These historical test cases serve as documentation of what problem was being solved.

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2026-01-15T18:23:48.147Z
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/annotation/annotation.go:131-166
Timestamp: 2026-01-15T18:23:48.147Z
Learning: In fault-remediation/pkg/annotation/annotation.go, the node annotation for remediation state is designed with the assumption that only one controller should be acting on a given equivalence group at a time. Concurrent modifications to the same part of the node annotation aren't expected in normal operation.

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-27T15:42:48.142Z
Learnt from: lalitadithya
Repo: NVIDIA/NVSentinel PR: 459
File: docs/cancelling-breakfix.md:58-90
Timestamp: 2025-11-27T15:42:48.142Z
Learning: The label `k8saas.nvidia.com/ManagedByNVSentinel` exists in NVSentinel and is used to opt nodes out of automated break-fix workflows. The prefix `k8saas.nvidia.com/` is configurable via the `labelPrefix` setting in the fault-quarantine Helm values, and the documentation correctly uses the default value.

Applied to files:

  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2026-01-15T18:16:14.309Z
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/annotation/annotation.go:50-57
Timestamp: 2026-01-15T18:16:14.309Z
Learning: In fault-remediation/pkg/annotation/annotation.go, corrupt remediation state annotations are intentionally treated as empty state (returning an empty RemediationStateAnnotation) rather than returning an error. This graceful degradation prevents node workflows from getting stuck if manual actions or other issues corrupt the annotation. The unmarshal error is still logged for visibility.

Applied to files:

  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/crstatus/checker.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
🧬 Code graph analysis (6)
fault-remediation/pkg/remediation/remediation_test.go (3)
fault-remediation/pkg/config/config.go (2)
  • Template (47-50)
  • MaintenanceResource (27-44)
fault-remediation/pkg/remediation/remediation.go (2)
  • NewRemediationClient (71-129)
  • FaultRemediationClient (58-69)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (26-29)
fault-remediation/main.go (2)
commons/pkg/auditlogger/roundtripper.go (1)
  • NewAuditingRoundTripper (42-47)
fault-remediation/pkg/initializer/init.go (2)
  • InitializationParams (40-45)
  • InitializeAll (52-152)
fault-remediation/pkg/initializer/init.go (3)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewRemediationClient (71-129)
commons/pkg/statemanager/statemanager.go (2)
  • NewStateManager (206-210)
  • StateManager (197-200)
store-client/pkg/datastore/config.go (1)
  • LoadDatastoreConfig (27-44)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (5)
fault-remediation/pkg/config/config.go (2)
  • TomlConfig (59-71)
  • Template (47-50)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewRemediationClient (71-129)
commons/pkg/statemanager/statemanager.go (2)
  • NewStateManager (206-210)
  • StateManager (197-200)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • AnnotationKey (26-26)
fault-remediation/pkg/metrics/metrics.go (6)
  • TotalEventsReceived (33-38)
  • EventHandlingDuration (62-68)
  • EventsProcessed (39-45)
  • CRStatusCreated (25-25)
  • CRStatusSkipped (26-26)
  • ProcessingErrors (46-52)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (27-44)
fault-remediation/pkg/reconciler/reconciler.go (8)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • FaultRemediationClientInterface (29-35)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (30-35)
fault-remediation/pkg/metrics/metrics.go (5)
  • EventHandlingDuration (62-68)
  • TotalEventsReceived (33-38)
  • TotalUnsupportedRemediationActions (53-59)
  • ProcessingErrors (46-52)
  • EventsProcessed (39-45)
commons/pkg/statemanager/statemanager.go (1)
  • StateManager (197-200)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_NONE (96-96)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventDoc (20-23)
  • HealthEventData (26-29)
store-client/pkg/storewatcher/watch_store.go (1)
  • ChangeStreamWatcher (43-57)
store-client/pkg/utils/document_utils.go (1)
  • ExtractDocumentID (59-84)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (13)
tilt/csp-api-mock/Tiltfile (1)

26-30: LGTM! Host port change to avoid conflicts.

The port forward change from 8080:8080 to 8081:8080 shifts only the host-side port for local development while keeping the container port at 8080. No other configurations require updatesβ€”the Kubernetes deployment definitions and service endpoints correctly reference the internal container port (8080), not the host port.

fault-remediation/main.go (2)

42-48: Scheme initialization order is correct.

The package-level scheme variable is initialized before init() runs, so adding resources to it in init() works correctly. This properly registers core/v1 and batch/v1 resources for the controller-runtime manager.


105-175: Well-structured controller-runtime setup.

The unified setup flow properly:

  • Wraps config with auditing round-tripper
  • Configures manager with health probes, metrics, and leader election
  • Initializes components with the manager's client
  • Defers cleanup after successful initialization
  • Sets up the reconciler with the manager before starting

This follows controller-runtime best practices and implements proper shutdown handling with context cancellation as per coding guidelines.

fault-remediation/pkg/remediation/remediation_test.go (2)

113-207: LGTM!

Good end-to-end test coverage for template file validation scenarios. The tests properly verify that initialization fails with appropriate error messages when template configuration is missing or empty.


209-358: LGTM!

Well-structured table-driven tests covering key scenarios: non-existent node, successful creation, and dry-run mode. The tests properly use fake clients and validate object creation through the Kubernetes API.

fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)

37-52: LGTM!

The TemplateData struct is well-documented with clear inline comments explaining each field's purpose.

fault-remediation/pkg/initializer/init.go (2)

85-101: LGTM with a minor suggestion.

The initialization flow correctly creates the remediation client and state manager with proper error wrapping. The error messages are now appropriately descriptive.

The log message at line 101 could be more specific (e.g., "Successfully initialized remediation client and state manager"), but this is a minor point.


135-144: LGTM!

The reconciler configuration properly wires the new RemediationClient and StateManager components.

fault-remediation/pkg/annotation/annotation.go (5)

15-42: LGTM!

Good package-level documentation and well-documented struct with constructor. The implementation follows Go conventions.


44-81: LGTM!

The GetRemediationState method correctly handles missing annotations by returning an empty state. The intentional graceful degradation for corrupt annotations (returning empty state rather than error) prevents workflows from getting stuck while still logging the error for visibility. Based on learnings, this is the intended design.


83-123: LGTM!

The UpdateRemediationState method correctly implements the read-modify-write pattern. Based on learnings, the design assumes only one controller acts on a given equivalence group at a time, making explicit retry-on-conflict unnecessary for this use case.


125-150: LGTM!

The ClearRemediationState method properly handles the case where annotations are nil and correctly removes the annotation key.


152-187: LGTM overall!

The RemoveGroupFromState method correctly handles group removal and optimizes by clearing the entire annotation when no groups remain. The logic for delegating to ClearRemediationState when empty is clean.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 498 to +500
// Even with label update errors, remediation should still succeed
success, crName, err := r.performRemediation(ctx, healthEventDoc)
assert.NoError(t, err)
assert.True(t, success)
assert.Equal(t, "test-cr-label-error", crName)
_, err := r.performRemediation(ctx, healthEventDoc)
assert.Error(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Test comment conflicts with asserted behavior.

Comment says remediation β€œshould still succeed” but the test asserts an error. Update the comment or the assertion to match expected behavior.

βœ… Suggested fix (comment)
-	// Even with label update errors, remediation should still succeed
+	// Label update errors should be surfaced to the caller
πŸ€– Prompt for AI Agents
In `@fault-remediation/pkg/reconciler/reconciler_test.go` around lines 498 - 500,
The comment in reconciler_test.go contradicts the assertion: the test comment
says remediation "should still succeed" but the test asserts an error for
r.performRemediation(ctx, healthEventDoc); either change the comment to state
that remediation should return an error or change the assertion to
assert.NoError(t, err) to match the "should succeed" expectation; update the
line containing r.performRemediation(...) and its surrounding comment so the
comment and the assertion (assert.Error vs assert.NoError) consistently reflect
the intended behavior.

Comment on lines 278 to 280
// handleRemediationEvent processes remediation for quarantined nodes
// nolint: cyclop // todo
func (r *FaultRemediationReconciler) handleRemediationEvent(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

TODO needs an issue reference.

Repo guideline: TODOs must point to a tracker.

βœ… Suggested fix
-// nolint: cyclop // todo
+// nolint: cyclop // TODO(NVS-1234): refactor handleRemediationEvent to reduce cyclomatic complexity
πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// handleRemediationEvent processes remediation for quarantined nodes
// nolint: cyclop // todo
func (r *FaultRemediationReconciler) handleRemediationEvent(
// handleRemediationEvent processes remediation for quarantined nodes
// nolint: cyclop // TODO(NVS-1234): refactor handleRemediationEvent to reduce cyclomatic complexity
func (r *FaultRemediationReconciler) handleRemediationEvent(
πŸ€– Prompt for AI Agents
In `@fault-remediation/pkg/reconciler/reconciler.go` around lines 278 - 280, The
TODO in the comment above the handleRemediationEvent function on the
FaultRemediationReconciler needs an issue reference per repo guidelines; update
the comment to replace "// nolint: cyclop // todo" with a properly formatted
tracker reference (for example "// nolint: cyclop // TODO: track in
ISSUE-<number>" or similar) and ensure the referenced issue ID/URL exists and
describes the cyclomatic complexity refactor for handleRemediationEvent so
reviewers can find the ticket.

@KaivalyaMDabhadkar
Copy link
Contributor

/ok to test db379df

@github-actions
Copy link

Merging this branch changes the coverage (2 decrease, 2 increase)

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/fault-remediation 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation 32.12% (+32.12%) 🌟
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus 30.58% (-12.37%) πŸ’€
github.com/nvidia/nvsentinel/fault-remediation/pkg/events 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 21.37% (-4.19%) πŸ‘Ž
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation 28.16% (+28.16%) 🌟

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/fault-remediation/main.go 0.00% (ΓΈ) 433 (-8) 0 433 (-8)
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation.go 32.12% (+32.12%) 358 (+358) 115 (+115) 243 (+243) 🌟
github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation_interface.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/checker.go 30.58% (-12.37%) 206 (+57) 63 (-1) 143 (+58) πŸ’€
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/crstatus_interface.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/events/health_event.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer/init.go 0.00% (ΓΈ) 248 (+88) 0 248 (+88)
github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics/metrics.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/annotation.go 0.00% (-24.29%) 0 (-280) 0 (-68) 0 (-212) πŸ’€ πŸ’€
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler.go 21.37% (-10.03%) 1287 (+427) 275 (+5) 1012 (+422) πŸ’€
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation.go 0.00% (-20.48%) 0 (-918) 0 (-188) 0 (-730) πŸ’€ πŸ’€
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/fault_remediation_client_interface.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation.go 28.16% (+28.16%) 1289 (+1289) 363 (+363) 926 (+926) 🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation/annotation_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/crstatus_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/annotation_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation_test.go

Signed-off-by: Igor Velichkovich <[email protected]>
@natherz97
Copy link
Contributor

/ok to test 4abe534

Signed-off-by: Igor Velichkovich <[email protected]>
@natherz97
Copy link
Contributor

/ok to test f5530d5

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

πŸ€– Fix all issues with AI agents
In @.github/workflows/e2e-test.yml:
- Line 161: Remove the temporary test-filtering environment variable: delete or
unset the TEST_EXTRA_FLAGS entry that sets '-run ^TestLogCollectorFailure$' so
the workflow no longer restricts E2E runs to only TestLogCollectorFailure;
locate the TEST_EXTRA_FLAGS definition in the e2e workflow and either remove the
line or restore it to the intended/default value so the full E2E suite runs.
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 4abe534 and f5530d5.

πŸ“’ Files selected for processing (2)
  • .github/workflows/e2e-test.yml
  • tests/helpers/kube.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/helpers/kube.go
🧰 Additional context used
🧠 Learnings (1)
πŸ““ Common learnings
Learnt from: ivelichkovich
Repo: NVIDIA/NVSentinel PR: 544
File: fault-remediation/pkg/annotation/annotation.go:131-166
Timestamp: 2026-01-15T18:23:55.797Z
Learning: In fault-remediation/pkg/annotation/annotation.go, the node annotation for remediation state is designed with the assumption that only one controller should be acting on a given equivalence group at a time. Concurrent modifications to the same part of the node annotation aren't expected in normal operation.
πŸ”‡ Additional comments (1)
.github/workflows/e2e-test.yml (1)

204-204: LGTM!

Collecting job resources in debug artifacts is a useful addition for debugging log-collector job failures introduced in this PR.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Signed-off-by: Igor Velichkovich <[email protected]>
@natherz97
Copy link
Contributor

/ok to test ff06947

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

πŸ€– Fix all issues with AI agents
In `@tests/Makefile`:
- Line 71: The Makefile test target currently restricts E2E runs with the `-run
^TestLogCollectorFailure$` filter; remove that `-run` argument from the
`$(GOTESTSUM)` invocation so the command runs the full test suite (restore
`$(GOTESTSUM) --format standard-verbose -- -tags=$(TEST_TAGS)
$(TEST_EXTRA_FLAGS) ./... -coverprofile=coverage.txt -covermode atomic
-coverpkg=github.com/nvidia/nvsentinel/$(MODULE_NAME)/...`), and if the filter
was left intentionally, add a comment explaining why and create a tracking issue
to restore full coverage instead of leaving the filter in place.
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between f5530d5 and ff06947.

πŸ“’ Files selected for processing (3)
  • .github/workflows/e2e-test.yml
  • distros/kubernetes/nvsentinel/charts/fault-remediation/files/log-collector-job.yaml
  • tests/Makefile
🧰 Additional context used
🧠 Learnings (5)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/Makefile
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/Makefile
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/Makefile
πŸ“š Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

  • tests/Makefile
πŸ“š Learning: 2025-11-27T15:42:48.142Z
Learnt from: lalitadithya
Repo: NVIDIA/NVSentinel PR: 459
File: docs/cancelling-breakfix.md:58-90
Timestamp: 2025-11-27T15:42:48.142Z
Learning: The label `k8saas.nvidia.com/ManagedByNVSentinel` exists in NVSentinel and is used to opt nodes out of automated break-fix workflows. The prefix `k8saas.nvidia.com/` is configurable via the `labelPrefix` setting in the fault-quarantine Helm values, and the documentation correctly uses the default value.

Applied to files:

  • distros/kubernetes/nvsentinel/charts/fault-remediation/files/log-collector-job.yaml
πŸ”‡ Additional comments (2)
distros/kubernetes/nvsentinel/charts/fault-remediation/files/log-collector-job.yaml (1)

23-24: LGTM β€” Job-level label matches pod labels.
Nice to have the Job metadata labeled consistently with the pod template for selection and tracking.

.github/workflows/e2e-test.yml (1)

203-203: LGTM!

Good addition to capture Job resources on failure. This aligns well with the remediation refactoring that introduces job-based log collector workflows. The command follows the established pattern with proper error handling (|| true).

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Signed-off-by: Igor Velichkovich <[email protected]>
@natherz97
Copy link
Contributor

/ok to test cad8be0

Signed-off-by: Igor Velichkovich <[email protected]>
@natherz97
Copy link
Contributor

/ok to test c2fc2de

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants