Flush on upsert with threshold eviction enabled#12030
Flush on upsert with threshold eviction enabled#12030roryharr wants to merge 2 commits intoanza-xyz:masterfrom
Conversation
| pub fn should_flush(&self, entries_in_bin: usize) -> bool { | ||
| /// Returns true when `entries_in_bin` exceeds the per-bin high-water mark, indicating | ||
| /// the bin is over the threshold and flush/eviction should occur. | ||
| pub fn bin_at_threshold(&self, entries_in_bin: usize) -> bool { |
There was a problem hiding this comment.
should_flush didn't make sense in the calling side now as its used to determine eviction. It Could be handled as a separate PR but it is hard to justify in isolation.
| Err(err) => disk.grow(err), | ||
| } | ||
| } | ||
| self.get_only_in_mem(pubkey, false, |entry| { |
There was a problem hiding this comment.
Originally I was trying to avoid this lookup but I couldn't figure out any way to avoid this lookup without adding memory. I attempted to use a CAS, but it would've required an Arc, or required the same lookup
| && self.storage.bin_at_threshold(map.len()) | ||
| && !map.contains_key(pubkey) | ||
| { | ||
| let evict_key = map |
There was a problem hiding this comment.
In practice, even the smallest bin size (25GB) has 24k entries and <50 dirty entries. With randomized binning, targeting a specific bin is now difficult so the risk is low.
|
|
||
| if v.dirty() { | ||
| candidates_to_flush.push(*k); | ||
| if collect_flush_candidates { |
There was a problem hiding this comment.
This is to avoid double flushes in foreground/background.
| user_fn: impl FnOnce(SlotListWriteGuard<T>) -> RT, | ||
| ) -> Option<RT> { | ||
| self.get_internal_inner(pubkey, |entry| { | ||
| let mut write_through_args: Option<(Slot, T)> = None; |
There was a problem hiding this comment.
This covers the non upsert paths:
purge_exact, clean_rooted_entries and purge_roots. So with this all paths flush as appropriate.
| pub capacity_in_mem: AtomicUsize, | ||
| pub flush_entries_updated_on_disk: AtomicU64, | ||
| pub flush_entries_evicted_from_mem: AtomicU64, | ||
| pub flush_entries_updated_on_disk_immediate: AtomicU64, |
There was a problem hiding this comment.
Useful to differentiate what is happening in the immediate path vs the background path.
| reclaims, | ||
| reclaim, | ||
| ); | ||
| should_write_through = |
There was a problem hiding this comment.
Upsert path: This code doesn't modify the cached path at all, only the store path.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #12030 +/- ##
========================================
Coverage 83.3% 83.4%
========================================
Files 861 861
Lines 322256 322491 +235
========================================
+ Hits 268750 269023 +273
+ Misses 53506 53468 -38 🚀 New features to boost your workflow:
|
|
I think we have one edge case here now. At startup when generating the index, we identify any accounts with duplicate versions. Those duplicates are later passed to agave/accounts-db/src/accounts_index.rs Line 1141 in 82e67b6 This index entry won't be written back to disk explicitly by startup. And without a background thread writing to disk anymore, it won't be picked up there anymore. However, if we shrink/squash this storage later, then we will clean it up. So not never, but not guaranteed. It may be good to replace the call to (Edit: And could do this write-through-instead-of-make-dirty on its own, right now too.) |
Interesting, I guess I didn't see this edge case in my testing was with fastboot. With fastboot the obsolete accounts are filtered during generate_index_for_slot and never added to the index at all. I think this can be fixed by making clean_and_unref_slot_list_on_startup call a variation on slot_list_mut. In general I think it's a better design. I'll post a separate PR to be merge first on master. |
brooksprumo
left a comment
There was a problem hiding this comment.
Did a first pass. Still need to go over tests too.
Posted here:#12069 Also ran perf numbers with this change as well (leading to flush): They were a little surprising, extra time is very minimal: As you can see, ~3m entries are flushed right at the end of startup now, which is the right amount. |
…ngle slot list entries to disk during upsert This resolves bins over shooting and doubling in size, removing the purpose of having fixed size bins
- Split up disk write into individual function - Simplified dirty check - Updated commenting - Renaming
a45766c to
927509b
Compare
| let disk_entry = [(slot, account_info.into())]; | ||
| let grow_us = Self::write_to_disk(disk, pubkey, &disk_entry); | ||
| Self::update_stat(&self.stats().flush_entries_updated_on_disk_immediate, 1); | ||
| Self::update_stat(&self.stats().flush_grow_us, grow_us); |
There was a problem hiding this comment.
My thought is to keep this as a single counter. All the grows should be during write_through in threshold mode. So if there is grow timing it must come from threshold mode.
The other counter (flush_entries_updated_on_disk_immediate vs flush_entries_updated_on_disk_background) is useful to separate for debug/validation purposes. It allows us to verify that flush_entries_updated_on_disk_background is zero as expected.
There was a problem hiding this comment.
Yep, I see my wording was ambiguous. I agree with a single metric for the grow time. I meant the caller would know if the 'num updated' was immediate vs background.
brooksprumo
left a comment
There was a problem hiding this comment.
Code's looking good. One question below. Will go through tests next.
| // This is a rare case; background eviction clears the excess over time. | ||
| if self.should_write_through | ||
| && self.storage.is_bin_at_threshold(map.len()) | ||
| && !map.contains_key(pubkey) |
There was a problem hiding this comment.
Can we skip this 'contains' check?
I would imagine the common case is that when a pubkey exists in the map, we'll end up in the if arm above. Once we get to the else here, yes there is a chance this pubkey could've been inserted before we grab the bin's write lock.
In that case, if the pubkey does exist already, then we wouldn't strictly need to evict an entry. But if we do anyway, the index will still be correct, we just may have to do a lookup from disk later.
My reasoning is that I think that searching through the bin for the pubkey is expensive (in the call to contains_key()), and I think common case is 'contains' returns false. So if we checked above and the pubkey didn't exist, and write_through is true, and is_bin_at_threshold is true then we go ahead and always evict, which saves use the contains check.
Wdyt?
There was a problem hiding this comment.
Ha, I tried a few other ways to remove it. didn't think of just removing it.
I would say profile it, but I don't think this will show up on profiles: it only triggers the extra lookup when
- The bin is at the threshold
- The item we are looking for wasn't found in memory.
I do agree that the chance of that performing an unneeded eviction is extremely low due to the tightness of the race.
This will also be resolved in the follow-up PR discussed separately: With post insertion eviction, there is no need for the contains call. it can just be called on the Vacant branch.
So I can change it here, but I'm just not sure there's much benefit, and i will be fixing it in the future!
Up to you though.
Side Note: One other interesting option to think about is grabbing the disk entry before grabbing the write lock. It might have a small race though.
| // Confirm all entries are clean (write-through fired). | ||
| for pubkey in &initial_pubkeys { | ||
| index.get_only_in_mem(pubkey, false, |entry| { | ||
| let entry = entry.expect("entry should be in memory"); | ||
| assert!(!entry.dirty()); | ||
| }); | ||
| } |
There was a problem hiding this comment.
Should we also assert the pubkeys have entries on disk too?


This resolves bins over shooting and doubling in size, removing the purpose of having fixed size bins
Problem
Threshold based eviction is designed to keep each in memory bin limited to a static size. However, under load, this can overshoot leading to bins doubling in size. It is reasonable that all bins could be doubled in size, and potentially grown more. This defeats the purpose of the feature.
Summary of Changes
Data: Comparisons
25GB Disk Index with this change: RoryPpbo6f9tYF4e8Y9twGBwkDCpjQXYsnWGYj5cTE2
50GB Disk index without this change: 3XU8nQ1Wfz5u36KxPRw92pCQe4Qx58vYQQzypmVrC9vx
Edge Canary with in memory index: mce1wLhL38KeLgaxNmpgFUipKySJwChtsCqbfbuqBi1
Edge Canary with in memory index: mce2QVkKHe1dyuNdRSXUreZkpQVnFxcPBdM9Ya5Hns9')
Biggest downside is disk write bandwidth increase:
Note
This comparison is using BtJmUemxN7YQHCdkAmPiDitHEEH9eSMrMJWeHKERX7hE as a comparison as it is running 25GB index. All other data was gathered earlier.
This is ~ 50MB/s and scales with the number of pubkeys that are only stored once per flush cycle (so any unique writes, and any rare accesses). I don't think this is scalable in the long term, but there is a fair bit of headroom before disk bandwidth is a concern. And if the disk bandwidth does become a concern, the outcome is just slow replay, crashing will not occur.
Read Bandwidth looks fine:

Store times look fine:

Replay times: No degradation to median or max.
Median:
Max:
Memory: Stable

Max Bin Count is extremely stable:

This was a little surprising to me, so i looked to see if any background evictions are occurring and there are:

As for foreground evictions, they are occurring at very low rates

Finally: comparing the maximum bin value with the background flush: Some minor overshoot can be seen on the base model, but none with this change.

Fixes #