Skip to content

Conversation

@ixhamza
Copy link
Member

@ixhamza ixhamza commented Nov 17, 2025

Motivation and Context

This fixes a deadlock that occurs when snapshot expiry tasks are cancelled while locks are held. The deadlock causes the system to hang with multiple threads blocked indefinitely, requiring a system restart. The issue manifests under heavy snapshot automount load combined with memory pressure triggering ARC pruning.

Description

The deadlock occurs when the snapshot expiry task, ARC memory reclamation, and lock acquisition form a circular dependency. The sequence is:

  1. snapentry_expire task spawns an umount process via call_usermodehelper() and waits for completion
  2. Concurrently, memory pressure triggers arc_prune which acquires locks (z_teardown_lock)
  3. arc_prune calls zfs_exit_fs()zfsctl_snapshot_unmount_delay() to reschedule snapshot expiry
  4. zfsctl_snapshot_unmount_delay() attempts to cancel the running expiry task with taskq_cancel_id()
  5. Old taskq_cancel_id() blocks waiting for the task to complete (holds lock while waiting)
  6. The umount process spawned in step 1 blocks trying to acquire locks held by arc_prune
  7. Circular dependency: expiry task waits for umount → umount waits for arc_prune → arc_prune waits for expiry task

The fix adds a boolean wait parameter to taskq_cancel_id():

  • wait=B_TRUE: Block until task completes (default behavior for all callers)
  • wait=B_FALSE: Return EBUSY immediately if task is running (non-blocking)

The zfs_exit_fs() path now uses non-blocking cancellation (wait=B_FALSE), breaking the deadlock by returning immediately when the expiry task is already running. Additional changes include removing the per-entry se_taskqid_lock (all taskqid operations now use global zfs_snapshot_lock as WRITER), and adding an se_in_umount flag to prevent recursive waits when zfsctl_destroy() is called during unmount.

Hung Task Stack Trace:

[  363.714403] INFO: task spl_delay_taskq:182 blocked for more than 120 seconds.
[  363.717021]       Not tainted 6.12.43-production+ #658
[  363.718977] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.723435] task:spl_delay_taskq state:D stack:0     pid:182   tgid:182   ppid:2      flags:0x00004000
[  363.726758] Call Trace:
[  363.728106]  <TASK>
[  363.728906]  __schedule+0x40b/0x13b0
[  363.730275]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[  363.732264]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[  363.734081]  ? try_to_wake_up+0xec/0x760
[  363.735543]  schedule+0x2a/0x100
[  363.737732]  schedule_timeout+0x14c/0x160
[  363.739305]  ? _raw_spin_unlock+0x19/0x40
[  363.740977]  ? __queue_work.part.0+0xf1/0x3e0
[  363.742622]  ? preempt_count_add+0x7b/0xc0
[  363.744391]  wait_for_completion_state+0x135/0x1e0
[  363.746124]  call_usermodehelper_exec+0x174/0x1b0
[  363.747942]  call_usermodehelper+0x93/0xb0
[  363.749654]  zfsctl_snapshot_unmount+0xf3/0x240
[  363.751478]  snapentry_expire+0x7a/0x180
[  363.753263]  taskq_thread+0x284/0x5d0
[  363.754641]  ? __pfx_default_wake_function+0x10/0x10
[  363.756703]  ? __pfx_taskq_thread+0x10/0x10
[  363.758220]  kthread+0xf3/0x120
[  363.759537]  ? __pfx_kthread+0x10/0x10
[  363.761427]  ret_from_fork+0x3d/0x60
[  363.762816]  ? __pfx_kthread+0x10/0x10
[  363.765424]  ret_from_fork_asm+0x1a/0x30
[  363.767505]  </TASK>
[  363.768697] INFO: task arc_prune:189 blocked for more than 120 seconds.
[  363.771110]       Not tainted 6.12.43-production+ #658
[  363.773377] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.776205] task:arc_prune       state:D stack:0     pid:189   tgid:189   ppid:2      flags:0x00004000
[  363.780464] Call Trace:
[  363.781598]  <TASK>
[  363.782449]  __schedule+0x40b/0x13b0
[  363.784040]  ? __lruvec_stat_mod_folio+0xbd/0xe0
[  363.785724]  schedule+0x2a/0x100
[  363.786978]  schedule_preempt_disabled+0x18/0x30
[  363.788985]  rwsem_down_read_slowpath+0x24e/0x480
[  363.790726]  down_read+0x4b/0xc0
[  363.792269]  zfsctl_snapshot_unmount_delay+0x23/0xe0
[  363.794082]  zfs_exit_fs+0x85/0x90
[  363.795393]  zfs_exit+0x12/0x30
[  363.798031]  zfs_prune+0xb9/0x2d0
[  363.799474]  zpl_prune_sb+0x90/0xa0
[  363.802208]  ? __pfx_zpl_prune_sb+0x10/0x10
[  363.803754]  arc_prune_task+0x22/0x40
[  363.806804]  taskq_thread+0x284/0x5d0
[  363.809803]  ? __pfx_default_wake_function+0x10/0x10
[  363.813805]  ? __pfx_taskq_thread+0x10/0x10
[  363.817002]  kthread+0xf3/0x120
[  363.818191]  ? __pfx_kthread+0x10/0x10
[  363.819600]  ret_from_fork+0x3d/0x60
[  363.821363]  ? __pfx_kthread+0x10/0x10
[  363.823222]  ret_from_fork_asm+0x1a/0x30
[  363.825182]  </TASK>
[  363.880214] INFO: task spl_delay_taskq:4477 blocked for more than 120 seconds.
[  363.882827]       Not tainted 6.12.43-production+ #658
[  363.884875] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.887698] task:spl_delay_taskq state:D stack:0     pid:4477  tgid:4477  ppid:2      flags:0x00004000
[  363.891378] Call Trace:
[  363.892475]  <TASK>
[  363.893290]  __schedule+0x40b/0x13b0
[  363.894648]  ? dequeue_entities+0x52c/0x6e0
[  363.896465]  ? psi_group_change+0x126/0x340
[  363.897992]  ? kvm_sched_clock_read+0x11/0x20
[  363.899622]  schedule+0x2a/0x100
[  363.901044]  schedule_preempt_disabled+0x18/0x30
[  363.902777]  rwsem_down_write_slowpath+0x239/0x5b0
[  363.905349]  ? __pv_queued_spin_lock_slowpath+0xa0/0x380
[  363.907299]  down_write+0x62/0x80
[  363.908683]  snapentry_expire+0x35/0x180
[  363.910141]  taskq_thread+0x284/0x5d0
[  363.911515]  ? __pfx_default_wake_function+0x10/0x10
[  363.913767]  ? __pfx_taskq_thread+0x10/0x10
[  363.915376]  kthread+0xf3/0x120
[  363.916721]  ? __pfx_kthread+0x10/0x10
[  363.918094]  ret_from_fork+0x3d/0x60
[  363.919428]  ? __pfx_kthread+0x10/0x10
[  363.921025]  ret_from_fork_asm+0x1a/0x30
[  363.922524]  </TASK>
[  363.923765] INFO: task spl_delay_taskq:4483 blocked for more than 121 seconds.
[  363.926382]       Not tainted 6.12.43-production+ #658
[  363.928677] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.931518] task:spl_delay_taskq state:D stack:0     pid:4483  tgid:4483  ppid:2      flags:0x00004000
[  363.935072] Call Trace:
[  363.936243]  <TASK>
[  363.937038]  __schedule+0x40b/0x13b0
[  363.938397]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[  363.940394]  ? dl_server_stop+0x2f/0x40
[  363.941788]  schedule+0x2a/0x100
[  363.943024]  schedule_preempt_disabled+0x18/0x30
[  363.945136]  rwsem_down_read_slowpath+0x24e/0x480
[  363.946850]  down_read+0x4b/0xc0
[  363.948342]  zfsctl_snapshot_unmount+0x71/0x240
[  363.949988]  snapentry_expire+0x7a/0x180
[  363.951465]  taskq_thread+0x284/0x5d0
[  363.953075]  ? __pfx_default_wake_function+0x10/0x10
[  363.954900]  ? __pfx_taskq_thread+0x10/0x10
[  363.957267]  kthread+0xf3/0x120
[  363.958974]  ? __pfx_kthread+0x10/0x10
[  363.961105]  ret_from_fork+0x3d/0x60
[  363.963014]  ? __pfx_kthread+0x10/0x10
[  363.965210]  ret_from_fork_asm+0x1a/0x30
[  363.966702]  </TASK>

How Has This Been Tested?

Reproduction script:

zpool create -f testpool mirror /dev/sdc /dev/sdd -O mountpoint=none
mkdir -p /run/testfs
zfs create -o mountpoint=/run/testfs -o snapdir=visible testpool/testfs
echo 1 > /sys/module/zfs/parameters/zfs_expire_snapshot
echo 524288 > /sys/module/zfs/parameters/zfs_arc_dnode_limit
for i in {1..1000}; do zfs snapshot testpool/testfs@snap$i; done
export SLEEP_AMOUNT=1
for group in {0..9}; do
    for proc in {0..9}; do
        start=$((group * 100 + proc * 10 + 1))
        end=$((start + 9))
        bash -c "for attempt in {1..43200}; do for i in {$start..$end}; do
        ls /run/testfs/.zfs/snapshot/snapi/ >/dev/null 2>&1 & done;
        sleep SLEEP_AMOUNT; echo attempt; done" &
    done
    sleep 1
done

Results:

  • Without fix: Deadlock occurs within ~5 minutes with reproduction script, system hangs with hung task warnings
  • With fix: No deadlock after 24+ hours of continuous testing with reproduction script
  • Verified task cancellation works correctly with both blocking and non-blocking modes

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@ixhamza ixhamza force-pushed the SEE-365-snapshot-deadlock branch 3 times, most recently from 035a7e1 to e5f8df9 Compare November 17, 2025 15:03
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 17, 2025
@ixhamza ixhamza force-pushed the SEE-365-snapshot-deadlock branch 2 times, most recently from ae31dec to 40afb39 Compare November 18, 2025 10:23
A deadlock occurs when snapshot expiry tasks are cancelled while holding
locks. The snapshot expiry task (snapentry_expire) spawns an umount
process and waits for it to complete. Concurrently, ARC memory pressure
triggers arc_prune which calls zfs_exit_fs(), attempting to cancel the
expiry task while holding locks. The umount process spawned by the
expiry task blocks trying to acquire locks held by arc_prune, which is
blocked waiting for the expiry task to complete. This creates a circular
dependency: expiry task waits for umount, umount waits for arc_prune,
arc_prune waits for expiry task.

Fix by adding non-blocking cancellation support to taskq_cancel_id().
The zfs_exit_fs() path calls zfsctl_snapshot_unmount_delay() to
reschedule the unmount, which needs to cancel any existing expiry task.
It now uses non-blocking cancellation to avoid waiting while holding
locks, breaking the deadlock by returning immediately when the task is
already running.

The per-entry se_taskqid_lock has been removed, with all taskqid
operations now protected by the global zfs_snapshot_lock held as
WRITER. Additionally, an se_in_umount flag prevents recursive waits when
zfsctl_destroy() is called during unmount. The taskqid is now only
cleared by the caller on successful cancellation; running tasks clear
their own taskqid upon completion.

Signed-off-by: Ameer Hamza <[email protected]>
@ixhamza ixhamza force-pushed the SEE-365-snapshot-deadlock branch from 40afb39 to 17811cc Compare November 18, 2025 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants