Skip to content

Flush on upsert with threshold eviction enabled#12030

Open
roryharr wants to merge 2 commits intoanza-xyz:masterfrom
roryharr:flush_on_upsert_resolved_race
Open

Flush on upsert with threshold eviction enabled#12030
roryharr wants to merge 2 commits intoanza-xyz:masterfrom
roryharr:flush_on_upsert_resolved_race

Conversation

@roryharr
Copy link
Copy Markdown

@roryharr roryharr commented Apr 17, 2026

This resolves bins over shooting and doubling in size, removing the purpose of having fixed size bins

Problem

Threshold based eviction is designed to keep each in memory bin limited to a static size. However, under load, this can overshoot leading to bins doubling in size. It is reasonable that all bins could be doubled in size, and potentially grown more. This defeats the purpose of the feature.

Summary of Changes

  • Always flush during upsert, or when an entry is modified to a 'normal' state
  • If a memory bin is full, evict an entry before adding a new entry
  • Disable background flushing during threshold testing (it is unneeded and creates flush races)
  • Tests

Data: Comparisons
25GB Disk Index with this change: RoryPpbo6f9tYF4e8Y9twGBwkDCpjQXYsnWGYj5cTE2
50GB Disk index without this change: 3XU8nQ1Wfz5u36KxPRw92pCQe4Qx58vYQQzypmVrC9vx
Edge Canary with in memory index: mce1wLhL38KeLgaxNmpgFUipKySJwChtsCqbfbuqBi1
Edge Canary with in memory index: mce2QVkKHe1dyuNdRSXUreZkpQVnFxcPBdM9Ya5Hns9')

Biggest downside is disk write bandwidth increase:

Note

This comparison is using BtJmUemxN7YQHCdkAmPiDitHEEH9eSMrMJWeHKERX7hE as a comparison as it is running 25GB index. All other data was gathered earlier.

image

This is ~ 50MB/s and scales with the number of pubkeys that are only stored once per flush cycle (so any unique writes, and any rare accesses). I don't think this is scalable in the long term, but there is a fair bit of headroom before disk bandwidth is a concern. And if the disk bandwidth does become a concern, the outcome is just slow replay, crashing will not occur.

Read Bandwidth looks fine:
image

Store times look fine:
image

Replay times: No degradation to median or max.
Median: image

Max: image

Memory: Stable
image

Max Bin Count is extremely stable:
image

This was a little surprising to me, so i looked to see if any background evictions are occurring and there are:
image

As for foreground evictions, they are occurring at very low rates
image

Finally: comparing the maximum bin value with the background flush: Some minor overshoot can be seen on the base model, but none with this change.
image

Fixes #

pub fn should_flush(&self, entries_in_bin: usize) -> bool {
/// Returns true when `entries_in_bin` exceeds the per-bin high-water mark, indicating
/// the bin is over the threshold and flush/eviction should occur.
pub fn bin_at_threshold(&self, entries_in_bin: usize) -> bool {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should_flush didn't make sense in the calling side now as its used to determine eviction. It Could be handled as a separate PR but it is hard to justify in isolation.

Err(err) => disk.grow(err),
}
}
self.get_only_in_mem(pubkey, false, |entry| {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I was trying to avoid this lookup but I couldn't figure out any way to avoid this lookup without adding memory. I attempted to use a CAS, but it would've required an Arc, or required the same lookup

&& self.storage.bin_at_threshold(map.len())
&& !map.contains_key(pubkey)
{
let evict_key = map
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, even the smallest bin size (25GB) has 24k entries and <50 dirty entries. With randomized binning, targeting a specific bin is now difficult so the risk is low.


if v.dirty() {
candidates_to_flush.push(*k);
if collect_flush_candidates {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to avoid double flushes in foreground/background.

user_fn: impl FnOnce(SlotListWriteGuard<T>) -> RT,
) -> Option<RT> {
self.get_internal_inner(pubkey, |entry| {
let mut write_through_args: Option<(Slot, T)> = None;
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This covers the non upsert paths:
purge_exact, clean_rooted_entries and purge_roots. So with this all paths flush as appropriate.

pub capacity_in_mem: AtomicUsize,
pub flush_entries_updated_on_disk: AtomicU64,
pub flush_entries_evicted_from_mem: AtomicU64,
pub flush_entries_updated_on_disk_immediate: AtomicU64,
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful to differentiate what is happening in the immediate path vs the background path.

reclaims,
reclaim,
);
should_write_through =
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upsert path: This code doesn't modify the cached path at all, only the store path.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 17, 2026

Codecov Report

❌ Patch coverage is 97.42647% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.4%. Comparing base (89cf826) to head (927509b).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #12030    +/-   ##
========================================
  Coverage    83.3%    83.4%            
========================================
  Files         861      861            
  Lines      322256   322491   +235     
========================================
+ Hits       268750   269023   +273     
+ Misses      53506    53468    -38     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@roryharr roryharr requested a review from brooksprumo April 18, 2026 04:47
@roryharr roryharr marked this pull request as ready for review April 18, 2026 04:47
@brooksprumo
Copy link
Copy Markdown

brooksprumo commented Apr 20, 2026

I think we have one edge case here now. At startup when generating the index, we identify any accounts with duplicate versions. Those duplicates are later passed to mark_obsolete_accounts_at_startup(), which eventually calls clean_and_unref_slot_list_on_startup(), and inside we mark the entry as 'dirty':

entry.mark_dirty();

This index entry won't be written back to disk explicitly by startup. And without a background thread writing to disk anymore, it won't be picked up there anymore. However, if we shrink/squash this storage later, then we will clean it up. So not never, but not guaranteed.

It may be good to replace the call to mark_dirty() with writing to disk. Wdyt?

(Edit: And could do this write-through-instead-of-make-dirty on its own, right now too.)

@roryharr
Copy link
Copy Markdown
Author

I think we have one edge case here now. At startup when generating the index, we identify any accounts with duplicate versions. Those duplicates are later passed to mark_obsolete_accounts_at_startup(), which eventually calls clean_and_unref_slot_list_on_startup(), and inside we mark the entry as 'dirty':

entry.mark_dirty();

This index entry won't be written back to disk explicitly by startup. And without a background thread writing to disk anymore, it won't be picked up there anymore. However, if we shrink/squash this storage later, then we will clean it up. So not never, but not guaranteed.

It may be good to replace the call to mark_dirty() with writing to disk. Wdyt?

Interesting, I guess I didn't see this edge case in my testing was with fastboot. With fastboot the obsolete accounts are filtered during generate_index_for_slot and never added to the index at all.

I think this can be fixed by making clean_and_unref_slot_list_on_startup call a variation on slot_list_mut. In general I think it's a better design.

I'll post a separate PR to be merge first on master.

Copy link
Copy Markdown

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass. Still need to go over tests too.

Comment thread accounts-db/src/accounts_index/bucket_map_holder.rs Outdated
Comment thread accounts-db/src/accounts_index/in_mem_accounts_index.rs Outdated
Comment thread accounts-db/src/accounts_index/in_mem_accounts_index.rs
Comment thread accounts-db/src/accounts_index/in_mem_accounts_index.rs Outdated
Comment thread accounts-db/src/accounts_index/in_mem_accounts_index.rs Outdated
Comment thread accounts-db/src/accounts_index/in_mem_accounts_index.rs Outdated
@roryharr
Copy link
Copy Markdown
Author

I think we have one edge case here now. At startup when generating the index, we identify any accounts with duplicate versions. Those duplicates are later passed to mark_obsolete_accounts_at_startup(), which eventually calls clean_and_unref_slot_list_on_startup(), and inside we mark the entry as 'dirty':

entry.mark_dirty();

This index entry won't be written back to disk explicitly by startup. And without a background thread writing to disk anymore, it won't be picked up there anymore. However, if we shrink/squash this storage later, then we will clean it up. So not never, but not guaranteed.
It may be good to replace the call to mark_dirty() with writing to disk. Wdyt?

Interesting, I guess I didn't see this edge case in my testing was with fastboot. With fastboot the obsolete accounts are filtered during generate_index_for_slot and never added to the index at all.

I think this can be fixed by making clean_and_unref_slot_list_on_startup call a variation on slot_list_mut. In general I think it's a better design.

I'll post a separate PR to be merge first on master.

Posted here:#12069

Also ran perf numbers with this change as well (leading to flush):

They were a little surprising, extra time is very minimal:
First entry is with the new PR, Second is baseline, and 3rd is with this PR on top of the new PR:
image

So low that I double checked:
image

As you can see, ~3m entries are flushed right at the end of startup now, which is the right amount.

…ngle slot list entries to disk during upsert

This resolves bins over shooting and doubling in size, removing the purpose of having fixed size bins
- Split up disk write into individual function
- Simplified dirty check
- Updated commenting
- Renaming
@roryharr roryharr force-pushed the flush_on_upsert_resolved_race branch from a45766c to 927509b Compare April 21, 2026 18:04
let disk_entry = [(slot, account_info.into())];
let grow_us = Self::write_to_disk(disk, pubkey, &disk_entry);
Self::update_stat(&self.stats().flush_entries_updated_on_disk_immediate, 1);
Self::update_stat(&self.stats().flush_grow_us, grow_us);
Copy link
Copy Markdown
Author

@roryharr roryharr Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is to keep this as a single counter. All the grows should be during write_through in threshold mode. So if there is grow timing it must come from threshold mode.

The other counter (flush_entries_updated_on_disk_immediate vs flush_entries_updated_on_disk_background) is useful to separate for debug/validation purposes. It allows us to verify that flush_entries_updated_on_disk_background is zero as expected.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I see my wording was ambiguous. I agree with a single metric for the grow time. I meant the caller would know if the 'num updated' was immediate vs background.

Copy link
Copy Markdown

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code's looking good. One question below. Will go through tests next.

// This is a rare case; background eviction clears the excess over time.
if self.should_write_through
&& self.storage.is_bin_at_threshold(map.len())
&& !map.contains_key(pubkey)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we skip this 'contains' check?

I would imagine the common case is that when a pubkey exists in the map, we'll end up in the if arm above. Once we get to the else here, yes there is a chance this pubkey could've been inserted before we grab the bin's write lock.

In that case, if the pubkey does exist already, then we wouldn't strictly need to evict an entry. But if we do anyway, the index will still be correct, we just may have to do a lookup from disk later.

My reasoning is that I think that searching through the bin for the pubkey is expensive (in the call to contains_key()), and I think common case is 'contains' returns false. So if we checked above and the pubkey didn't exist, and write_through is true, and is_bin_at_threshold is true then we go ahead and always evict, which saves use the contains check.

Wdyt?

Copy link
Copy Markdown
Author

@roryharr roryharr Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, I tried a few other ways to remove it. didn't think of just removing it.

I would say profile it, but I don't think this will show up on profiles: it only triggers the extra lookup when

  1. The bin is at the threshold
  2. The item we are looking for wasn't found in memory.

I do agree that the chance of that performing an unneeded eviction is extremely low due to the tightness of the race.

This will also be resolved in the follow-up PR discussed separately: With post insertion eviction, there is no need for the contains call. it can just be called on the Vacant branch.

So I can change it here, but I'm just not sure there's much benefit, and i will be fixing it in the future!

Up to you though.

Side Note: One other interesting option to think about is grabbing the disk entry before grabbing the write lock. It might have a small race though.

@brooksprumo brooksprumo self-requested a review April 21, 2026 20:35
Comment on lines +2974 to +2980
// Confirm all entries are clean (write-through fired).
for pubkey in &initial_pubkeys {
index.get_only_in_mem(pubkey, false, |entry| {
let entry = entry.expect("entry should be in memory");
assert!(!entry.dirty());
});
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also assert the pubkeys have entries on disk too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants