Fix two bugs in the flag_existing_quarantined_media background update#19901
Conversation
…media The `flag_existing_quarantined_media` background update paged through `remote_media_cache` with `media_origin >= ? AND media_id > ?`, which ANDs the two columns independently. Once an origin was fully processed, rows in a later origin whose `media_id` was <= the last processed `media_id` would be silently skipped and never flagged. Use a proper row-value tuple comparison `(media_origin, media_id) > (?, ?)` (matching the table's unique constraint/index), as done elsewhere in the codebase. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`flag_quarantined` ran both the local and remote queries on every iteration. Once one table was exhausted but the other still had rows to process, the background update kept returning a positive count, so the finished table's (now empty) query needlessly re-ran on every subsequent iteration until the whole update completed. Track per-table completion (`local_done`/`remote_done`) in the background update progress and skip a table's query once it has returned an empty batch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `flag_existing_quarantined_media` background update shipped (in 94/03) with a broken remote media query that silently skipped some already-quarantined remote media, and servers may have already run it to completion. Now that the query is fixed, add a schema delta that deletes any existing background update row and re-inserts it so the update runs again from scratch and flags the media missed on the first run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the changelog entry for PR #19901, noting the bug was introduced in v1.152.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
88661e0 to
5ed45d6
Compare
| # We use a >= ? on the media origin to avoid missing records when media IDs | ||
| # collide between origins (the table's unique constraint is on `(media_origin, media_id)`). | ||
| # Filtering by `(media_origin, media_id)` also makes sure we're using an index. |
There was a problem hiding this comment.
We're losing this context
There was a problem hiding this comment.
I'm not quite sure what context, but 99b235f makes it clearer that we're using an index and that we'll paginate over every row. The >= comment is no longer relevant.
| # We page through `remote_media_cache` with a tuple comparison on | ||
| # `(media_origin, media_id)` (matching the table's unique constraint and | ||
| # index). Comparing the columns independently (e.g. | ||
| # `media_origin >= ? AND media_id > ?`) would incorrectly skip rows in a | ||
| # newly-reached origin whose media_id is <= the last processed media_id. |
There was a problem hiding this comment.
This comment is a bit clunky for what it's trying to convey.
| @@ -0,0 +1 @@ | |||
| Fix the `flag_existing_quarantined_media` background update skipping some quarantined remote media and re-running exhausted queries unnecessarily. Introduced in v1.152.0. | |||
There was a problem hiding this comment.
How did you notice these flaws?
There was a problem hiding this comment.
I saw that it was taking ~2s to do 1 item, and while staring at the query realised that it was wrong. That didn't explain the slowness, but then I realised we were repeatedly calling the query with the same args.
|
fwiw, we spotted the remote media bug after the merge but didn't consider it important enough to fix. Remote media is purged quickly enough where it doesn't provide much value into the hashing system, and if it was to be re-downloaded or accessed then the built-in Synapse auto-quarantine function would have the same effect as our tooling. Further, as we roll out hash matching at more and more levels, we're likely to catch future media before it reaches Synapse's stores anyway. The initial import is fully intended to be best effort, not complete. |
| -- We delete any existing row first: on servers where the update already completed the row | ||
| -- was removed, and on servers where it's still pending/mid-run this clears the stale | ||
| -- progress so the re-insert below starts cleanly (and avoids a primary key collision). | ||
| DELETE FROM background_updates WHERE update_name = 'flag_existing_quarantined_media'; |
There was a problem hiding this comment.
it's a huge shame to lose so many months of progress, especially when we're still running the background update on matrix.org. Can we update the job to run after the existing one or otherwise de-duplicate data going into the table?
There was a problem hiding this comment.
This is at odds with what you wrote above. Why do you care?
fwiw, we spotted the remote media bug after the merge but didn't consider it important enough to fix.
The initial import is fully intended to be best effort, not complete.
There was a problem hiding this comment.
On matrix.org we can manually set to "last_local_media_id":"xJFXwTD" and then it will start from where it left of for local media.
This is at odds with what you wrote above. Why do you care?
I think we care about local media, but not about remote media, maybe?
Co-authored-by: Eric Eastwood <erice@element.io>
I mean, fair, but I'm not a fan of leaving around broken code. I reckon it missed the majority of remote media. (It is also causing the local media to be handled at ~0.5Hz) |
The
flag_existing_quarantined_mediabackground update (added in #19558, shipped in v1.152.0) back-populates thequarantined_media_changestable with media that was already quarantined. It has two bugs, fixed here in one commit each.1. Some quarantined remote media is silently skipped
The remote-media query paged through
remote_media_cachewith:This ANDs the two key columns independently rather than comparing them as a tuple. Once an origin has been fully processed (e.g.
media_idreacheszzzfor origina.example), rows in a later origin whosemedia_idis<=the last processedmedia_id(e.g.b.example/aaa) fail themedia_id > ?test and are never flagged.Fixed by using a proper row-value tuple comparison:
Both the minimum supported SQLite (3.37.2) and PostgreSQL support row-value comparisons.
2. Exhausted queries keep re-running every iteration
flag_quarantinedran both the local and remote queries on every iteration. When one table was exhausted but the other still had rows, the update kept returning a positive count, so the finished table's (now empty) query needlessly re-ran on every subsequent iteration until the whole update completed.This could add significant time to the transaction, as since the rows were deleted it could scan a significant portion of the table each time.
Fixed by tracking per-table completion (
local_done/remote_done) in the background-update progress and skipping a table's query once it has returned an empty batch. The progress dict was also restructured into an incremental build for readability.A thrid commit will cause this background job to run again. This will re-insert media again into the
quarantined_media_changes, but that should be fine.cc @turt2live @MadLittleMods