Flush on upsert with threshold eviction enabled by roryharr · Pull Request #12030 · anza-xyz/agave

roryharr · 2026-04-17T19:35:37Z

This resolves bins over shooting and doubling in size, removing the purpose of having fixed size bins

Problem

Threshold based eviction is designed to keep each in memory bin limited to a static size. However, under load, this can overshoot leading to bins doubling in size. It is reasonable that all bins could be doubled in size, and potentially grown more. This defeats the purpose of the feature.

Summary of Changes

Always flush during upsert, or when an entry is modified to a 'normal' state
If a memory bin is full, evict an entry before adding a new entry
Disable background flushing during threshold testing (it is unneeded and creates flush races)
Tests

Data: Comparisons
25GB Disk Index with this change: RoryPpbo6f9tYF4e8Y9twGBwkDCpjQXYsnWGYj5cTE2
50GB Disk index without this change: 3XU8nQ1Wfz5u36KxPRw92pCQe4Qx58vYQQzypmVrC9vx
Edge Canary with in memory index: mce1wLhL38KeLgaxNmpgFUipKySJwChtsCqbfbuqBi1
Edge Canary with in memory index: mce2QVkKHe1dyuNdRSXUreZkpQVnFxcPBdM9Ya5Hns9')

Biggest downside is disk write bandwidth increase:

Note

This comparison is using BtJmUemxN7YQHCdkAmPiDitHEEH9eSMrMJWeHKERX7hE as a comparison as it is running 25GB index. All other data was gathered earlier.

This is ~ 50MB/s and scales with the number of pubkeys that are only stored once per flush cycle (so any unique writes, and any rare accesses). I don't think this is scalable in the long term, but there is a fair bit of headroom before disk bandwidth is a concern. And if the disk bandwidth does become a concern, the outcome is just slow replay, crashing will not occur.

Read Bandwidth looks fine:

Store times look fine:

Replay times: No degradation to median or max.
Median:

Max:

Memory: Stable

Max Bin Count is extremely stable:

This was a little surprising to me, so i looked to see if any background evictions are occurring and there are:

As for foreground evictions, they are occurring at very low rates

Finally: comparing the maximum bin value with the background flush: Some minor overshoot can be seen on the base model, but none with this change.

Fixes #

roryharr · 2026-04-17T19:36:44Z

-    pub fn should_flush(&self, entries_in_bin: usize) -> bool {
+    /// Returns true when `entries_in_bin` exceeds the per-bin high-water mark, indicating
+    /// the bin is over the threshold and flush/eviction should occur.
+    pub fn bin_at_threshold(&self, entries_in_bin: usize) -> bool {


should_flush didn't make sense in the calling side now as its used to determine eviction. It Could be handled as a separate PR but it is hard to justify in isolation.

roryharr · 2026-04-17T19:38:22Z

+                Err(err) => disk.grow(err),
+            }
+        }
+        self.get_only_in_mem(pubkey, false, |entry| {


Originally I was trying to avoid this lookup but I couldn't figure out any way to avoid this lookup without adding memory. I attempted to use a CAS, but it would've required an Arc, or required the same lookup

roryharr · 2026-04-17T19:39:49Z

+                    && self.storage.bin_at_threshold(map.len())
+                    && !map.contains_key(pubkey)
+                {
+                    let evict_key = map


In practice, even the smallest bin size (25GB) has 24k entries and <50 dirty entries. With randomized binning, targeting a specific bin is now difficult so the risk is low.

roryharr · 2026-04-17T19:41:30Z


            if v.dirty() {
-                candidates_to_flush.push(*k);
+                if collect_flush_candidates {


This is to avoid double flushes in foreground/background.

roryharr · 2026-04-17T19:42:17Z

        user_fn: impl FnOnce(SlotListWriteGuard<T>) -> RT,
    ) -> Option<RT> {
-        self.get_internal_inner(pubkey, |entry| {
+        let mut write_through_args: Option<(Slot, T)> = None;


This covers the non upsert paths:
purge_exact, clean_rooted_entries and purge_roots. So with this all paths flush as appropriate.

roryharr · 2026-04-17T19:42:58Z

    pub capacity_in_mem: AtomicUsize,
-    pub flush_entries_updated_on_disk: AtomicU64,
-    pub flush_entries_evicted_from_mem: AtomicU64,
+    pub flush_entries_updated_on_disk_immediate: AtomicU64,


Useful to differentiate what is happening in the immediate path vs the background path.

roryharr · 2026-04-17T19:43:47Z

                    reclaims,
                    reclaim,
                );
+                should_write_through =


Upsert path: This code doesn't modify the cached path at all, only the store path.

codecov-commenter · 2026-04-17T20:07:08Z

Codecov Report

❌ Patch coverage is 97.42647% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.4%. Comparing base (89cf826) to head (927509b).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #12030    +/-   ##
========================================
  Coverage    83.3%    83.4%            
========================================
  Files         861      861            
  Lines      322256   322491   +235     
========================================
+ Hits       268750   269023   +273     
+ Misses      53506    53468    -38

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

brooksprumo · 2026-04-20T18:01:58Z

I think we have one edge case here now. At startup when generating the index, we identify any accounts with duplicate versions. Those duplicates are later passed to mark_obsolete_accounts_at_startup(), which eventually calls clean_and_unref_slot_list_on_startup(), and inside we mark the entry as 'dirty':

agave/accounts-db/src/accounts_index.rs

Line 1141 in 82e67b6

entry.mark_dirty();

This index entry won't be written back to disk explicitly by startup. And without a background thread writing to disk anymore, it won't be picked up there anymore. However, if we shrink/squash this storage later, then we will clean it up. So not never, but not guaranteed.

It may be good to replace the call to mark_dirty() with writing to disk. Wdyt?

(Edit: And could do this write-through-instead-of-make-dirty on its own, right now too.)

roryharr · 2026-04-20T20:57:28Z

I think we have one edge case here now. At startup when generating the index, we identify any accounts with duplicate versions. Those duplicates are later passed to mark_obsolete_accounts_at_startup(), which eventually calls clean_and_unref_slot_list_on_startup(), and inside we mark the entry as 'dirty':

agave/accounts-db/src/accounts_index.rs

Line 1141 in 82e67b6

entry.mark_dirty();

This index entry won't be written back to disk explicitly by startup. And without a background thread writing to disk anymore, it won't be picked up there anymore. However, if we shrink/squash this storage later, then we will clean it up. So not never, but not guaranteed.

It may be good to replace the call to mark_dirty() with writing to disk. Wdyt?

Interesting, I guess I didn't see this edge case in my testing was with fastboot. With fastboot the obsolete accounts are filtered during generate_index_for_slot and never added to the index at all.

I think this can be fixed by making clean_and_unref_slot_list_on_startup call a variation on slot_list_mut. In general I think it's a better design.

I'll post a separate PR to be merge first on master.

brooksprumo

Did a first pass. Still need to go over tests too.

roryharr · 2026-04-20T23:06:03Z

I think we have one edge case here now. At startup when generating the index, we identify any accounts with duplicate versions. Those duplicates are later passed to mark_obsolete_accounts_at_startup(), which eventually calls clean_and_unref_slot_list_on_startup(), and inside we mark the entry as 'dirty':

agave/accounts-db/src/accounts_index.rs

Line 1141 in 82e67b6

entry.mark_dirty();

This index entry won't be written back to disk explicitly by startup. And without a background thread writing to disk anymore, it won't be picked up there anymore. However, if we shrink/squash this storage later, then we will clean it up. So not never, but not guaranteed.
It may be good to replace the call to mark_dirty() with writing to disk. Wdyt?

Interesting, I guess I didn't see this edge case in my testing was with fastboot. With fastboot the obsolete accounts are filtered during generate_index_for_slot and never added to the index at all.

I think this can be fixed by making clean_and_unref_slot_list_on_startup call a variation on slot_list_mut. In general I think it's a better design.

I'll post a separate PR to be merge first on master.

Posted here:#12069

Also ran perf numbers with this change as well (leading to flush):

They were a little surprising, extra time is very minimal:
First entry is with the new PR, Second is baseline, and 3rd is with this PR on top of the new PR:

So low that I double checked:

As you can see, ~3m entries are flushed right at the end of startup now, which is the right amount.

…ngle slot list entries to disk during upsert This resolves bins over shooting and doubling in size, removing the purpose of having fixed size bins

- Split up disk write into individual function - Simplified dirty check - Updated commenting - Renaming

roryharr · 2026-04-21T18:07:25Z

+        let disk_entry = [(slot, account_info.into())];
+        let grow_us = Self::write_to_disk(disk, pubkey, &disk_entry);
+        Self::update_stat(&self.stats().flush_entries_updated_on_disk_immediate, 1);
+        Self::update_stat(&self.stats().flush_grow_us, grow_us);


My thought is to keep this as a single counter. All the grows should be during write_through in threshold mode. So if there is grow timing it must come from threshold mode.

The other counter (flush_entries_updated_on_disk_immediate vs flush_entries_updated_on_disk_background) is useful to separate for debug/validation purposes. It allows us to verify that flush_entries_updated_on_disk_background is zero as expected.

Yep, I see my wording was ambiguous. I agree with a single metric for the grow time. I meant the caller would know if the 'num updated' was immediate vs background.

brooksprumo

Code's looking good. One question below. Will go through tests next.

brooksprumo · 2026-04-21T20:31:42Z

+                // This is a rare case; background eviction clears the excess over time.
+                if self.should_write_through
+                    && self.storage.is_bin_at_threshold(map.len())
+                    && !map.contains_key(pubkey)


Can we skip this 'contains' check?

I would imagine the common case is that when a pubkey exists in the map, we'll end up in the if arm above. Once we get to the else here, yes there is a chance this pubkey could've been inserted before we grab the bin's write lock.

In that case, if the pubkey does exist already, then we wouldn't strictly need to evict an entry. But if we do anyway, the index will still be correct, we just may have to do a lookup from disk later.

My reasoning is that I think that searching through the bin for the pubkey is expensive (in the call to contains_key()), and I think common case is 'contains' returns false. So if we checked above and the pubkey didn't exist, and write_through is true, and is_bin_at_threshold is true then we go ahead and always evict, which saves use the contains check.

Wdyt?

Ha, I tried a few other ways to remove it. didn't think of just removing it.

I would say profile it, but I don't think this will show up on profiles: it only triggers the extra lookup when

The bin is at the threshold

The item we are looking for wasn't found in memory.

I do agree that the chance of that performing an unneeded eviction is extremely low due to the tightness of the race.

This will also be resolved in the follow-up PR discussed separately: With post insertion eviction, there is no need for the contains call. it can just be called on the Vacant branch.

So I can change it here, but I'm just not sure there's much benefit, and i will be fixing it in the future!

Up to you though.

Side Note: One other interesting option to think about is grabbing the disk entry before grabbing the write lock. It might have a small race though.

brooksprumo · 2026-04-21T20:45:17Z

+        // Confirm all entries are clean (write-through fired).
+        for pubkey in &initial_pubkeys {
+            index.get_only_in_mem(pubkey, false, |entry| {
+                let entry = entry.expect("entry should be in memory");
+                assert!(!entry.dirty());
+            });
+        }


Should we also assert the pubkeys have entries on disk too?

roryharr commented Apr 17, 2026

View reviewed changes

roryharr requested a review from brooksprumo April 18, 2026 04:47

roryharr marked this pull request as ready for review April 18, 2026 04:47

brooksprumo reviewed Apr 20, 2026

View reviewed changes

roryharr mentioned this pull request Apr 20, 2026

Modify clean_and_unref_rooted_entries to use slot_list_mut code #12069

Merged

roryharr added 2 commits April 21, 2026 10:13

When threshold based eviction is enabled, always flush single ref, si…

5cdb7f9

…ngle slot list entries to disk during upsert This resolves bins over shooting and doubling in size, removing the purpose of having fixed size bins

Code review feedback

927509b

- Split up disk write into individual function - Simplified dirty check - Updated commenting - Renaming

roryharr force-pushed the flush_on_upsert_resolved_race branch from a45766c to 927509b Compare April 21, 2026 18:04

roryharr commented Apr 21, 2026

View reviewed changes

brooksprumo reviewed Apr 21, 2026

View reviewed changes

brooksprumo self-requested a review April 21, 2026 20:35

brooksprumo reviewed Apr 21, 2026

View reviewed changes

Conversation

roryharr commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Note

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

brooksprumo commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roryharr commented Apr 20, 2026

Uh oh!

brooksprumo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

roryharr commented Apr 20, 2026

Uh oh!

roryharr Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brooksprumo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roryharr Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

roryharr commented Apr 17, 2026 •

edited

Loading

codecov-commenter commented Apr 17, 2026 •

edited

Loading

brooksprumo commented Apr 20, 2026 •

edited

Loading

roryharr Apr 21, 2026 •

edited

Loading

roryharr Apr 21, 2026 •

edited

Loading