Skip to content

Conversation

rkistner
Copy link
Contributor

@rkistner rkistner commented Aug 27, 2025

Currently, checksums are calculated by summing over all data in each bucket. We then cache it in-memory, and incrementally update it with new data afterwards.

The issue is that for large buckets, the initial summing can be very slow and time out. #338 mitigates the issue by increasing the timeout, but this can still cause large delays when users connect after the process was restarted.

We could theoretically keep a checksum per bucket up-to-date while replicating, but in practice we may need older checksums for the last minute or two, which this won't provide.

So the workaround here is to pre-compute checksums for each bucket as part of the compact process, requiring very little additional overhead. If we assume a daily compact job, this would give a cached checksum that covers most cases unless the majority of the bucket was created in the last day.

Additionally, this starts calculating some stats per bucket: total number and size of operations at the last compact, and since then. This is not 100% accurate/consistent in all cases, but it would be a starting point for scheduling more incremental/on-demand compact jobs based on the number of new operations in each bucket (future PR).

TODO:

  • After an initial replication, we need another compact before the checksums are cached, which could result in a large period where users cannot sync due to the timeout. We need to run a compact before switching over to the newly-replicated copy.

Alternatives

We could apply a caching technique similar to the current in-memory caching: Cache a series of past checksums, and expire them occasionally.

Caveats with this approach:

  • It would still not completely solve the case of the first checksum calculation being slow.
  • Managing/expiring these would become more tricky.

Copy link

changeset-bot bot commented Aug 27, 2025

🦋 Changeset detected

Latest commit: 5da2232

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 11 packages
Name Type
@powersync/service-module-postgres-storage Minor
@powersync/service-module-mongodb-storage Minor
@powersync/service-core-tests Minor
@powersync/service-module-postgres Minor
@powersync/service-module-mongodb Minor
@powersync/service-core Minor
@powersync/service-module-mysql Minor
@powersync/service-schema Minor
@powersync/service-image Minor
@powersync/service-module-core Patch
test-client Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@rkistner rkistner marked this pull request as ready for review August 28, 2025 08:40
Copy link
Collaborator

@stevensJourney stevensJourney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I couldn't spot any issues from my side.

@rkistner rkistner merged commit 6d4a4d1 into main Aug 28, 2025
22 checks passed
@rkistner rkistner deleted the compact-checksums branch August 28, 2025 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants