Skip to content

avoid meta/data shrink when dnode is over quota #17522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

shodanshok
Copy link
Contributor

Fix: #17487

When iterating over millions of files, dnode cache grows almost unbounded consuming nearly all available ARC memory and pushing out valuable metadata and data. This is probably due to the kernel not releasing its dentries and inodes caches, keeping dnodes pinned and unable to be pruned.

This patch avoid shrinking metadata and data when dnode is over quota, forcing the kernel to drop its caches and, in turn, enabling the zfs shrinker thread to prune the now-unpinned dnodes.

Motivation and Context

See above.

Description

See above.

How Has This Been Tested?

The patch received only minimal testing inside a virtual machine, clearly increasing performance by letting ZFS to not evict valuable metadata. It works for me, but additional review is needed.

Some performance numbers, obtained by running time du -hs on a 10M files dir:

  • on a fast NVMe-backed pool, the second run needs ~50s (patched) versus ~1.15m (unpatched)
  • on a slower (simulated) HDD-based pool, the second run needs ~50 (patched) vs 5m (unpatched).

The patched version does not require any disk access at all for successive runs, while the unpatched one re-reads the same metadata from disk each time.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin
Copy link
Member

amotin commented Jul 9, 2025

While I can appreciate attempt to make kernel free memory instead, since there can be other caches that need to be evicted even outside of VFS, I wonder whether it can happen that kernel can not release those inodes/dnodes? Won't it end up in more OOMs?

One of the problem I see here is that arc_dnode_limit, which you rely on, is not adaptive. It is a fixed fraction of arc_max. In this change you even removed my attempt to dynamically scale pruning with a fraction of unevictable metadata, which should scale with ARC size. The biggest question to me here is why pruning no longer works. Does something else holds dnodes? Some caches pruning does not know about and which are not evicted without system-wide memory pressure?

PS: This version fails to build:

   /tmp/zfs-build-zfs-H1sfgxIm/BUILD/zfs-2.3.99/module/zfs/arc.c:4433:1: error: ‘arc_mf’ defined but not used [-Werror=unused-function]
   4433 | arc_mf(uint64_t x, uint64_t multiplier, uint64_t divisor)
        | ^~~~~~

Fix: openzfs#17487

When iterating over millions of files, dnode cache grows almost unbounded
consuming nearly all available ARC memory and pushing out valuable
metadata and data. This is probably due to the kernel not releasing its
dentries and inodes caches, keeping dnodes pinned and unable to be
pruned.

This patch avoid shrinking metadata and data when dnode is over quota,
forcing the kernel to drop its caches and, in turn, enabling the zfs
shrinker thread to prune the now-unpinned dnodes.

Signed-off-by: Gionatan Danti <[email protected]>
@shodanshok
Copy link
Contributor Author

shodanshok commented Jul 9, 2025

While I can appreciate attempt to make kernel free memory instead, since there can be other caches that need to be evicted even outside of VFS, I wonder whether it can happen that kernel can not release those inodes/dnodes? Won't it end up in more OOMs?

Well, if kernel can not really release these dentries/inodes, then some application keeps millions of files open. While this can end in more OOMs, does the system already have a bigger problem in this case? That said, maybe we can check S_RECLAIMABLE and, if low, permit metadata shrink. However, as part of ZFS own cache can be counted as such, I am not sure it would work (as we are going to always have at least some S_RECLAIMABLE).

One of the problem I see here is that arc_dnode_limit, which you rely on, is not adaptive. It is a fixed fraction of arc_max. In this change you even removed my attempt to dynamically scale pruning with a fraction of unevictable metadata, which should scale with ARC size.

True, I tried to simplify it as much as possible to avoid other issues (ie: possibly bad scaling).

The biggest question to me here is why pruning no longer works. Does something else holds dnodes? Some caches pruning does not know about and which are not evicted without system-wide memory pressure?

Yes, this is the big underlying question. I tried the same on a CentOS 7 box with ZFS 2.0.7 and here dnode reclaim works correctly - albeit too slowly to prevent metadata shrink. So I think we have two issues:

  • not working dnode shrink with current kernels (at least what RH 8/9 and Debian 12 ship) and ZFS (2.1-2.3).
  • even with working dnode reclaim, valuable metadata can be evicted - which should not happen, as hitting the disk is almost always worse than recreating the required dnodes from the cached metadata.

PS: This version fails to build:

   /tmp/zfs-build-zfs-H1sfgxIm/BUILD/zfs-2.3.99/module/zfs/arc.c:4433:1: error: ‘arc_mf’ defined but not used [-Werror=unused-function]
   4433 | arc_mf(uint64_t x, uint64_t multiplier, uint64_t divisor)
        | ^~~~~~

I removed this now-unused function, thanks.

@shodanshok
Copy link
Contributor Author

Closing, see #17542

@shodanshok shodanshok closed this Jul 15, 2025
@shodanshok shodanshok mentioned this pull request Jul 18, 2025
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dnode cache overquota but not shriking even if under memory pressure
2 participants