Skip to content

DNM: Reconcile follow up - Fix reconciliation loop using semantic comparison #467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

andreaskaris
Copy link
Contributor

@andreaskaris andreaskaris commented Aug 20, 2025

Follow-up to #461 - to be rebased and merged once #461 lands

Fix reconciliation loop using semantic comparison

This PR resolves a reconciliation loop issue in the bpfman-operator where resources were being
continuously updated even when no meaningful changes were present.

Changes

Controller Logic:

  • Modified assureResource function to accept a needsUpdate callback parameter
  • Added semantic deep equality comparison using k8s.io/apimachinery/pkg/api/equality
  • Resources are now only updated when actual differences exist between desired and current state
  • Applied conditional updates to ConfigMap, CSIDriver, SecurityContextConstraints, and DaemonSet
    resources

YAML Manifest Synchronization:

  • Updated daemonset.yaml and metrics-proxy-daemonset.yaml to include missing Kubernetes default
    fields
  • Added fields like terminationMessagePath, terminationMessagePolicy, apiVersion in fieldRef,
    updateStrategy, etc.
  • This ensures the YAML manifests match what Kubernetes actually deploys, reducing spec drift

Problem Solved

Previously, the operator would continuously reconcile resources because:

  1. The YAML manifests were missing default fields that Kubernetes adds automatically
  2. The assureResource function always performed updates without checking if changes were necessary
  3. This created an endless loop where the operator detected differences between desired and actual
    state

Testing

The changes maintain backward compatibility while preventing unnecessary resource churn. The operator
now only updates resources when there are actual semantic differences in the specifications.

andreaskaris and others added 8 commits August 20, 2025 19:01
Transition from ConfigMap to Config CRD to enable cascading deletion
via ownerReferences for namespace and cluster-scoped resources like
CSIDriver (ConfigMaps are namespace-scoped and cannot be owners of
cluster-scoped resources). Additional benefit of Config CRD is
that configuration can be properly validated by the API server.
Refactor of reconciliation logic with improved logging and error
handling, unit tests.

Signed-off-by: Andreas Karis <[email protected]>
Add finalizer to Config CRD to prevent race condition during deletion
where operator could recreate resources while Kubernetes garbage
collection is trying to delete them via owner references.

Changes:
- Add BpfmanConfigFinalizer constant
- Add finalizer RBAC permissions via controller-gen
- Check deletion timestamp before reconciliation
- Add finalizer management to prevent race conditions
- Add handleDeletion for coordinated cleanup
- Update tests for two-phase reconciliation
- Fix undeploy targets to handle cluster-scoped Config CRD

This ensures safe deletion sequence:
1. User deletes Config
2. Deletion timestamp set, finalizer blocks deletion
3. Operator stops creating/updating resources
4. Operator removes finalizer
5. Kubernetes garbage collection proceeds via owner references

Signed-off-by: Andrew McDermott <[email protected]>
Add a new run-local Makefile target that allows developers to run the
bpfman-operator locally for development and debugging purposes.
The target automatically scales down the cluster deployment to avoid
conflicts and runs the operator with debug logging enabled.

Signed-off-by: Andreas Karis <[email protected]>
Add conditional updates to Config reconciliation using semantic deep
equality comparison to prevent unnecessary resource updates. This
resolves reconciliation loops where the operator would continuously
update resources even when no meaningful changes were present.

Additionally, sync the DaemonSet and metrics-proxy YAML manifests
with the actual deployed resources by adding missing Kubernetes
default fields like terminationMessagePath, apiVersion in fieldRef,
and updateStrategy settings.

Signed-off-by: Andreas Karis <[email protected]>
@andreaskaris andreaskaris changed the title Reconcile follow up - Fix reconciliation loop using semantic comparison DNM: Reconcile follow up - Fix reconciliation loop using semantic comparison Aug 20, 2025
Copy link

codecov bot commented Aug 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (9bb7c45) to head (24fa019).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files
@@     Coverage Diff     @@
##   main   #467   +/-   ##
===========================
===========================

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants