[MongoDB Storage] Pre-calculate checksums when compacting #341
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, checksums are calculated by summing over all data in each bucket. We then cache it in-memory, and incrementally update it with new data afterwards.
The issue is that for large buckets, the initial summing can be very slow and time out. #338 mitigates the issue by increasing the timeout, but this can still cause large delays when users connect after the process was restarted.
We could theoretically keep a checksum per bucket up-to-date while replicating, but in practice we may need older checksums for the last minute or two, which this won't provide.
So the workaround here is to pre-compute checksums for each bucket as part of the
compact
process, requiring very little additional overhead. If we assume a daily compact job, this would give a cached checksum that covers most cases unless the majority of the bucket was created in the last day.Additionally, this starts calculating some stats per bucket: total number and size of operations at the last compact, and since then. This is not 100% accurate/consistent in all cases, but it would be a starting point for scheduling more incremental/on-demand compact jobs based on the number of new operations in each bucket (future PR).
TODO:
Alternatives
We could apply a caching technique similar to the current in-memory caching: Cache a series of past checksums, and expire them occasionally.
Caveats with this approach: