Fix snapshot automount expiry cancellation deadlock #17941

ixhamza · 2025-11-17T12:13:54Z

Motivation and Context

This fixes a deadlock that occurs when snapshot expiry tasks are cancelled while locks are held. The deadlock causes the system to hang with multiple threads blocked indefinitely, requiring a system restart. The issue manifests under heavy snapshot automount load combined with memory pressure triggering ARC pruning.

Description

The deadlock occurs when the snapshot expiry task, ARC memory reclamation, and lock acquisition form a circular dependency. The sequence is:

snapentry_expire task spawns an umount process via call_usermodehelper() and waits for completion
Concurrently, memory pressure triggers arc_prune which acquires locks (z_teardown_lock)
arc_prune calls zfs_exit_fs() → zfsctl_snapshot_unmount_delay() to reschedule snapshot expiry
zfsctl_snapshot_unmount_delay() attempts to cancel the running expiry task with taskq_cancel_id()
Old taskq_cancel_id() blocks waiting for the task to complete (holds lock while waiting)
The umount process spawned in step 1 blocks trying to acquire locks held by arc_prune
Circular dependency: expiry task waits for umount → umount waits for arc_prune → arc_prune waits for expiry task

The fix adds a boolean wait parameter to taskq_cancel_id():

wait=B_TRUE: Block until task completes (default behavior for all callers)
wait=B_FALSE: Return EBUSY immediately if task is running (non-blocking)

The zfs_exit_fs() path now uses non-blocking cancellation (wait=B_FALSE), breaking the deadlock by returning immediately when the expiry task is already running. Additional changes include removing the per-entry se_taskqid_lock (all taskqid operations now use global zfs_snapshot_lock as WRITER), and adding an se_in_umount flag to prevent recursive waits when zfsctl_destroy() is called during unmount.

Hung Task Stack Trace:

[  363.714403] INFO: task spl_delay_taskq:182 blocked for more than 120 seconds.
[  363.717021]       Not tainted 6.12.43-production+ #658
[  363.718977] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.723435] task:spl_delay_taskq state:D stack:0     pid:182   tgid:182   ppid:2      flags:0x00004000
[  363.726758] Call Trace:
[  363.728106]  <TASK>
[  363.728906]  __schedule+0x40b/0x13b0
[  363.730275]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[  363.732264]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[  363.734081]  ? try_to_wake_up+0xec/0x760
[  363.735543]  schedule+0x2a/0x100
[  363.737732]  schedule_timeout+0x14c/0x160
[  363.739305]  ? _raw_spin_unlock+0x19/0x40
[  363.740977]  ? __queue_work.part.0+0xf1/0x3e0
[  363.742622]  ? preempt_count_add+0x7b/0xc0
[  363.744391]  wait_for_completion_state+0x135/0x1e0
[  363.746124]  call_usermodehelper_exec+0x174/0x1b0
[  363.747942]  call_usermodehelper+0x93/0xb0
[  363.749654]  zfsctl_snapshot_unmount+0xf3/0x240
[  363.751478]  snapentry_expire+0x7a/0x180
[  363.753263]  taskq_thread+0x284/0x5d0
[  363.754641]  ? __pfx_default_wake_function+0x10/0x10
[  363.756703]  ? __pfx_taskq_thread+0x10/0x10
[  363.758220]  kthread+0xf3/0x120
[  363.759537]  ? __pfx_kthread+0x10/0x10
[  363.761427]  ret_from_fork+0x3d/0x60
[  363.762816]  ? __pfx_kthread+0x10/0x10
[  363.765424]  ret_from_fork_asm+0x1a/0x30
[  363.767505]  </TASK>
[  363.768697] INFO: task arc_prune:189 blocked for more than 120 seconds.
[  363.771110]       Not tainted 6.12.43-production+ #658
[  363.773377] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.776205] task:arc_prune       state:D stack:0     pid:189   tgid:189   ppid:2      flags:0x00004000
[  363.780464] Call Trace:
[  363.781598]  <TASK>
[  363.782449]  __schedule+0x40b/0x13b0
[  363.784040]  ? __lruvec_stat_mod_folio+0xbd/0xe0
[  363.785724]  schedule+0x2a/0x100
[  363.786978]  schedule_preempt_disabled+0x18/0x30
[  363.788985]  rwsem_down_read_slowpath+0x24e/0x480
[  363.790726]  down_read+0x4b/0xc0
[  363.792269]  zfsctl_snapshot_unmount_delay+0x23/0xe0
[  363.794082]  zfs_exit_fs+0x85/0x90
[  363.795393]  zfs_exit+0x12/0x30
[  363.798031]  zfs_prune+0xb9/0x2d0
[  363.799474]  zpl_prune_sb+0x90/0xa0
[  363.802208]  ? __pfx_zpl_prune_sb+0x10/0x10
[  363.803754]  arc_prune_task+0x22/0x40
[  363.806804]  taskq_thread+0x284/0x5d0
[  363.809803]  ? __pfx_default_wake_function+0x10/0x10
[  363.813805]  ? __pfx_taskq_thread+0x10/0x10
[  363.817002]  kthread+0xf3/0x120
[  363.818191]  ? __pfx_kthread+0x10/0x10
[  363.819600]  ret_from_fork+0x3d/0x60
[  363.821363]  ? __pfx_kthread+0x10/0x10
[  363.823222]  ret_from_fork_asm+0x1a/0x30
[  363.825182]  </TASK>
[  363.880214] INFO: task spl_delay_taskq:4477 blocked for more than 120 seconds.
[  363.882827]       Not tainted 6.12.43-production+ #658
[  363.884875] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.887698] task:spl_delay_taskq state:D stack:0     pid:4477  tgid:4477  ppid:2      flags:0x00004000
[  363.891378] Call Trace:
[  363.892475]  <TASK>
[  363.893290]  __schedule+0x40b/0x13b0
[  363.894648]  ? dequeue_entities+0x52c/0x6e0
[  363.896465]  ? psi_group_change+0x126/0x340
[  363.897992]  ? kvm_sched_clock_read+0x11/0x20
[  363.899622]  schedule+0x2a/0x100
[  363.901044]  schedule_preempt_disabled+0x18/0x30
[  363.902777]  rwsem_down_write_slowpath+0x239/0x5b0
[  363.905349]  ? __pv_queued_spin_lock_slowpath+0xa0/0x380
[  363.907299]  down_write+0x62/0x80
[  363.908683]  snapentry_expire+0x35/0x180
[  363.910141]  taskq_thread+0x284/0x5d0
[  363.911515]  ? __pfx_default_wake_function+0x10/0x10
[  363.913767]  ? __pfx_taskq_thread+0x10/0x10
[  363.915376]  kthread+0xf3/0x120
[  363.916721]  ? __pfx_kthread+0x10/0x10
[  363.918094]  ret_from_fork+0x3d/0x60
[  363.919428]  ? __pfx_kthread+0x10/0x10
[  363.921025]  ret_from_fork_asm+0x1a/0x30
[  363.922524]  </TASK>
[  363.923765] INFO: task spl_delay_taskq:4483 blocked for more than 121 seconds.
[  363.926382]       Not tainted 6.12.43-production+ #658
[  363.928677] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.931518] task:spl_delay_taskq state:D stack:0     pid:4483  tgid:4483  ppid:2      flags:0x00004000
[  363.935072] Call Trace:
[  363.936243]  <TASK>
[  363.937038]  __schedule+0x40b/0x13b0
[  363.938397]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[  363.940394]  ? dl_server_stop+0x2f/0x40
[  363.941788]  schedule+0x2a/0x100
[  363.943024]  schedule_preempt_disabled+0x18/0x30
[  363.945136]  rwsem_down_read_slowpath+0x24e/0x480
[  363.946850]  down_read+0x4b/0xc0
[  363.948342]  zfsctl_snapshot_unmount+0x71/0x240
[  363.949988]  snapentry_expire+0x7a/0x180
[  363.951465]  taskq_thread+0x284/0x5d0
[  363.953075]  ? __pfx_default_wake_function+0x10/0x10
[  363.954900]  ? __pfx_taskq_thread+0x10/0x10
[  363.957267]  kthread+0xf3/0x120
[  363.958974]  ? __pfx_kthread+0x10/0x10
[  363.961105]  ret_from_fork+0x3d/0x60
[  363.963014]  ? __pfx_kthread+0x10/0x10
[  363.965210]  ret_from_fork_asm+0x1a/0x30
[  363.966702]  </TASK>

How Has This Been Tested?

Reproduction script:

zpool create -f testpool mirror /dev/sdc /dev/sdd -O mountpoint=none
mkdir -p /run/testfs
zfs create -o mountpoint=/run/testfs -o snapdir=visible testpool/testfs
echo 1 > /sys/module/zfs/parameters/zfs_expire_snapshot
echo 524288 > /sys/module/zfs/parameters/zfs_arc_dnode_limit
for i in {1..1000}; do zfs snapshot testpool/testfs@snap$i; done
export SLEEP_AMOUNT=1
for group in {0..9}; do
    for proc in {0..9}; do
        start=$((group * 100 + proc * 10 + 1))
        end=$((start + 9))
        bash -c "for attempt in {1..43200}; do for i in {$start..$end}; do
        ls /run/testfs/.zfs/snapshot/snapi/ >/dev/null 2>&1 & done;
        sleep SLEEP_AMOUNT; echo attempt; done" &
    done
    sleep 1
done

Results:

Without fix: Deadlock occurs within ~5 minutes with reproduction script, system hangs with hung task warnings
With fix: No deadlock after 24+ hours of continuous testing with reproduction script
Verified task cancellation works correctly with both blocking and non-blocking modes

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

A deadlock occurs when snapshot expiry tasks are cancelled while holding locks. The snapshot expiry task (snapentry_expire) spawns an umount process and waits for it to complete. Concurrently, ARC memory pressure triggers arc_prune which calls zfs_exit_fs(), attempting to cancel the expiry task while holding locks. The umount process spawned by the expiry task blocks trying to acquire locks held by arc_prune, which is blocked waiting for the expiry task to complete. This creates a circular dependency: expiry task waits for umount, umount waits for arc_prune, arc_prune waits for expiry task. Fix by adding non-blocking cancellation support to taskq_cancel_id(). The zfs_exit_fs() path calls zfsctl_snapshot_unmount_delay() to reschedule the unmount, which needs to cancel any existing expiry task. It now uses non-blocking cancellation to avoid waiting while holding locks, breaking the deadlock by returning immediately when the task is already running. The per-entry se_taskqid_lock has been removed, with all taskqid operations now protected by the global zfs_snapshot_lock held as WRITER. Additionally, an se_in_umount flag prevents recursive waits when zfsctl_destroy() is called during unmount. The taskqid is now only cleared by the caller on successful cancellation; running tasks clear their own taskqid upon completion. Signed-off-by: Ameer Hamza <[email protected]>

ixhamza mentioned this pull request Nov 17, 2025

Fix taskq NULL pointer dereference on timer race #17942

Merged

14 tasks

ixhamza force-pushed the SEE-365-snapshot-deadlock branch 3 times, most recently from 035a7e1 to e5f8df9 Compare November 17, 2025 15:03

behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 17, 2025

ixhamza force-pushed the SEE-365-snapshot-deadlock branch 2 times, most recently from ae31dec to 40afb39 Compare November 18, 2025 10:23

ixhamza force-pushed the SEE-365-snapshot-deadlock branch from 40afb39 to 17811cc Compare November 18, 2025 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix snapshot automount expiry cancellation deadlock #17941

Fix snapshot automount expiry cancellation deadlock #17941

ixhamza commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix snapshot automount expiry cancellation deadlock #17941

Are you sure you want to change the base?

Fix snapshot automount expiry cancellation deadlock #17941

Conversation

ixhamza commented Nov 17, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants