fix: return success on DEL when CNI cache is missing by kvaps · Pull Request #1498 · k8snetworkplumbingwg/multus-cni

kvaps · 2026-03-31T09:15:22Z

Summary

When CNI ADD never completes (e.g. sandbox creation fails), no cache file is written
On cleanup, CmdDel reconstructs delegate configs from the API and calls DEL on each plugin
Delegate plugins fail because they require runtime state from a successful ADD
This creates a deadlock: containerd cannot release sandbox name reservations
Fix: return success immediately when cache file does not exist (ADD never completed)

Problem

After a node disruption, containerd accumulates stale sandbox name reservations. When it tries
to clean them up, CNI DEL is called through multus. Without a cache file, multus reconstructs
the config from the Kubernetes API, but delegate plugins (e.g. kube-ovn) reject the config
because it lacks runtime state (socket paths, cached results) that only exists after ADD.

This causes:

Sandbox name reservations persist indefinitely
New pods cannot be created on the affected node
DEL retry storm overloads multus, causing ADD timeouts for new pods

Fix

If the cache file does not exist (os.IsNotExist), return nil immediately since there is
nothing to clean up. This matches the existing graceful fallback when both cache and pod are
missing (the else branch at the former line 978).

Related issues

Test plan

Verify existing unit tests pass
Manual test: create a pod sandbox failure scenario, verify DEL returns success without cache

When a pod sandbox fails during creation (e.g. due to network issues), the CNI ADD never completes and no cache file is written. On subsequent cleanup attempts, CmdDel tries to reconstruct delegate configs from the Kubernetes API and call DEL on each delegate plugin. However, delegate plugins (e.g. kube-ovn) require runtime state that only exists after a successful ADD (socket paths, cached results), causing DEL to fail with errors like "server_socket is required in cni.conf". This creates a deadlock: containerd cannot release the sandbox name reservation because CNI DEL fails, and CNI DEL fails because ADD never completed. The constant DEL retry storm also overloads the multus daemon, causing new ADD requests to timeout. Fix this by returning success immediately when the cache file does not exist (os.IsNotExist), since there is nothing to clean up if ADD never completed. This matches the existing behavior for the case when both cache and pod are missing (lines 979-982 before this change). Related issues: - k8snetworkplumbingwg#1454 - k8snetworkplumbingwg#1446 - containerd/containerd#11504 Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

When CNI ADD never completes, multus DEL fails because delegate plugins require runtime state that only exists after a successful ADD. This creates a deadlock where containerd cannot release sandbox name reservations, permanently blocking pod creation on the affected node. Build a custom multus-cni image with a patch that returns success on DEL when no cache file exists (ADD was never completed). Upstream PR: k8snetworkplumbingwg/multus-cni#1498 Tracking issue: #2310 Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

gemini-code-assist

Code Review

This pull request simplifies the CmdDel logic in pkg/multus/multus.go by skipping the deletion process when a cache file is missing, which indicates that the ADD operation was never completed. This change prevents failures during sandbox cleanup by avoiding attempts to reconstruct configurations from the API without necessary runtime state. Feedback was provided to enhance the observability of error logs by including the container ID and to clarify log messages when cache parsing fails. Additionally, a refactor was suggested to combine redundant return paths for better readability.

gemini-code-assist · 2026-03-31T09:18:40Z

+		if os.IsNotExist(err) {
+			logging.Verbosef("Multus: no cache found for container %s (ADD was never completed), skip DEL", args.ContainerID)
 			return nil
 		}
+		// The options to continue with a delete have been exhausted
+		// We cannot exit with an error as this may cause a sandbox to never get deleted.
+		logging.Errorf("Multus: failed to get the cached delegates file: %v, cannot properly delete", err)
+		return nil


The error logging and logic in this block can be improved for better observability and clarity:

Context: The log message at line 968 loses the pod/container context. Including args.ContainerID (as done in the Verbosef call at line 963) would significantly help in identifying which container failed cleanup in the logs.

Clarity: If err is nil (which occurs if the cache file was read successfully but failed to unmarshal at line 941), the log will show : <nil>, which is misleading. It should ideally distinguish between a filesystem error and a parsing failure.

Simplification: Since both branches return nil, the logic can be combined to reduce redundancy and improve readability.

Suggested change

if os.IsNotExist(err) {

logging.Verbosef("Multus: no cache found for container %s (ADD was never completed), skip DEL", args.ContainerID)

return nil

}

// The options to continue with a delete have been exhausted

// We cannot exit with an error as this may cause a sandbox to never get deleted.

logging.Errorf("Multus: failed to get the cached delegates file: %v, cannot properly delete", err)

return nil

if os.IsNotExist(err) {

logging.Verbosef("Multus: no cache found for container %s (ADD was never completed), skip DEL", args.ContainerID)

} else {

// The options to continue with a delete have been exhausted

// We cannot exit with an error as this may cause a sandbox to never get deleted.

logging.Errorf("Multus: failed to get or parse the cached delegates file for container %s: %v, cannot properly delete", args.ContainerID, err)

}

return nil

## Summary - Build custom multus-cni image with a patch that fixes sandbox cleanup deadlock - When CNI ADD never completes, DEL now returns success instead of failing - Prevents stale sandbox name reservations from permanently blocking pod creation ## Problem After a node disruption (e.g. DRBD/kube-ovn issues during upgrade), containerd accumulates stale sandbox name reservations. Cleanup fails because multus calls delegate plugins for DEL without cached state, and they reject the incomplete config. This blocks all new pods on the affected node indefinitely. ## Changes - Add `images/multus-cni/` with Dockerfile and patch for custom image build - Add `image` target to multus Makefile - Add multus to root Makefile build targets - Update DaemonSet template to use the custom image ## Upstream - PR: k8snetworkplumbingwg/multus-cni#1498 - Tracking: #2310 ## Test plan - [x] Built and pushed image to ghcr.io/cozystack/cozystack/multus-cni - [x] Deployed to cluster, verified fix in multus logs ("no cache found... skip DEL") - [x] Force-deleted stuck pods, all 24 pods started successfully within 70s  ## Summary by CodeRabbit * **New Features** * Added Multus CNI container image build process to the system. * **Bug Fixes** * Fixed delete operation behavior when cache is unavailable, improving reliability of network cleanup operations. * **Chores** * Updated Multus CNI container image reference in deployment configuration to latest version with build improvements.

When CNI ADD never completes, multus DEL fails because delegate plugins require runtime state that only exists after a successful ADD. This creates a deadlock where containerd cannot release sandbox name reservations, permanently blocking pod creation on the affected node. Build a custom multus-cni image with a patch that returns success on DEL when no cache file exists (ADD was never completed). Upstream PR: k8snetworkplumbingwg/multus-cni#1498 Tracking issue: #2310 Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> (cherry picked from commit 094b80a)

mattia-eleuteri · 2026-04-02T06:28:55Z

We had the same issue specified here
and I can confirm that the fix proposed in this MR resolved our issue.

kvaps mentioned this pull request Mar 31, 2026

fix(multus): build custom image with DEL cache fix cozystack/cozystack#2313

Merged

3 tasks

gemini-code-assist Bot reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: return success on DEL when CNI cache is missing#1498

fix: return success on DEL when CNI cache is missing#1498
kvaps wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
kvaps:fix-del-without-cache

kvaps commented Mar 31, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Uh oh!

mattia-eleuteri commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kvaps commented Mar 31, 2026

Summary

Problem

Fix

Related issues

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

mattia-eleuteri commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants