Skip to content

Fix two bugs in the flag_existing_quarantined_media background update#19901

Merged
erikjohnston merged 6 commits into
developfrom
erikj/fix/flag-existing-quarantined-media
Jul 2, 2026
Merged

Fix two bugs in the flag_existing_quarantined_media background update#19901
erikjohnston merged 6 commits into
developfrom
erikj/fix/flag-existing-quarantined-media

Conversation

@erikjohnston

Copy link
Copy Markdown
Member

The flag_existing_quarantined_media background update (added in #19558, shipped in v1.152.0) back-populates the quarantined_media_changes table with media that was already quarantined. It has two bugs, fixed here in one commit each.

1. Some quarantined remote media is silently skipped

The remote-media query paged through remote_media_cache with:

WHERE quarantined_by IS NOT NULL
    AND media_origin >= ? AND media_id > ?

This ANDs the two key columns independently rather than comparing them as a tuple. Once an origin has been fully processed (e.g. media_id reaches zzz for origin a.example), rows in a later origin whose media_id is <= the last processed media_id (e.g. b.example / aaa) fail the media_id > ? test and are never flagged.

Fixed by using a proper row-value tuple comparison:

WHERE quarantined_by IS NOT NULL
    AND (media_origin, media_id) > (?, ?)

Both the minimum supported SQLite (3.37.2) and PostgreSQL support row-value comparisons.

2. Exhausted queries keep re-running every iteration

flag_quarantined ran both the local and remote queries on every iteration. When one table was exhausted but the other still had rows, the update kept returning a positive count, so the finished table's (now empty) query needlessly re-ran on every subsequent iteration until the whole update completed.

This could add significant time to the transaction, as since the rows were deleted it could scan a significant portion of the table each time.

Fixed by tracking per-table completion (local_done / remote_done) in the background-update progress and skipping a table's query once it has returned an empty batch. The progress dict was also restructured into an incremental build for readability.


A thrid commit will cause this background job to run again. This will re-insert media again into the quarantined_media_changes, but that should be fine.

cc @turt2live @MadLittleMods

erikjohnston and others added 4 commits July 1, 2026 10:15
…media

The `flag_existing_quarantined_media` background update paged through
`remote_media_cache` with `media_origin >= ? AND media_id > ?`, which ANDs the
two columns independently. Once an origin was fully processed, rows in a
later origin whose `media_id` was <= the last processed `media_id` would be
silently skipped and never flagged.

Use a proper row-value tuple comparison `(media_origin, media_id) > (?, ?)`
(matching the table's unique constraint/index), as done elsewhere in the
codebase.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`flag_quarantined` ran both the local and remote queries on every iteration.
Once one table was exhausted but the other still had rows to process, the
background update kept returning a positive count, so the finished table's
(now empty) query needlessly re-ran on every subsequent iteration until the
whole update completed.

Track per-table completion (`local_done`/`remote_done`) in the background
update progress and skip a table's query once it has returned an empty batch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `flag_existing_quarantined_media` background update shipped (in 94/03) with
a broken remote media query that silently skipped some already-quarantined
remote media, and servers may have already run it to completion. Now that the
query is fixed, add a schema delta that deletes any existing background update
row and re-inserts it so the update runs again from scratch and flags the media
missed on the first run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the changelog entry for PR #19901, noting the bug was introduced in
v1.152.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@erikjohnston erikjohnston force-pushed the erikj/fix/flag-existing-quarantined-media branch from 88661e0 to 5ed45d6 Compare July 1, 2026 10:51
@erikjohnston erikjohnston marked this pull request as ready for review July 1, 2026 13:52
@erikjohnston erikjohnston requested a review from a team as a code owner July 1, 2026 13:52
@MadLittleMods MadLittleMods added A-Media-Repository A-Abuse Reports, media quarantine, policy servers, etc labels Jul 1, 2026
Comment on lines -285 to -287
# We use a >= ? on the media origin to avoid missing records when media IDs
# collide between origins (the table's unique constraint is on `(media_origin, media_id)`).
# Filtering by `(media_origin, media_id)` also makes sure we're using an index.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're losing this context

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what context, but 99b235f makes it clearer that we're using an index and that we'll paginate over every row. The >= comment is no longer relevant.

Comment thread synapse/storage/databases/main/room.py Outdated
Comment on lines +297 to +301
# We page through `remote_media_cache` with a tuple comparison on
# `(media_origin, media_id)` (matching the table's unique constraint and
# index). Comparing the columns independently (e.g.
# `media_origin >= ? AND media_id > ?`) would incorrectly skip rows in a
# newly-reached origin whose media_id is <= the last processed media_id.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a bit clunky for what it's trying to convey.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have tried to reword a little 99b235f

Comment thread changelog.d/19901.bugfix Outdated
Comment thread changelog.d/19901.bugfix Outdated
@@ -0,0 +1 @@
Fix the `flag_existing_quarantined_media` background update skipping some quarantined remote media and re-running exhausted queries unnecessarily. Introduced in v1.152.0.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you notice these flaws?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that it was taking ~2s to do 1 item, and while staring at the query realised that it was wrong. That didn't explain the slowness, but then I realised we were repeatedly calling the query with the same args.

@turt2live

Copy link
Copy Markdown
Member

fwiw, we spotted the remote media bug after the merge but didn't consider it important enough to fix. Remote media is purged quickly enough where it doesn't provide much value into the hashing system, and if it was to be re-downloaded or accessed then the built-in Synapse auto-quarantine function would have the same effect as our tooling. Further, as we roll out hash matching at more and more levels, we're likely to catch future media before it reaches Synapse's stores anyway.

The initial import is fully intended to be best effort, not complete.

-- We delete any existing row first: on servers where the update already completed the row
-- was removed, and on servers where it's still pending/mid-run this clears the stale
-- progress so the re-insert below starts cleanly (and avoids a primary key collision).
DELETE FROM background_updates WHERE update_name = 'flag_existing_quarantined_media';

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a huge shame to lose so many months of progress, especially when we're still running the background update on matrix.org. Can we update the job to run after the existing one or otherwise de-duplicate data going into the table?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is at odds with what you wrote above. Why do you care?

fwiw, we spotted the remote media bug after the merge but didn't consider it important enough to fix.

The initial import is fully intended to be best effort, not complete.

@erikjohnston erikjohnston Jul 2, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On matrix.org we can manually set to "last_local_media_id":"xJFXwTD" and then it will start from where it left of for local media.

This is at odds with what you wrote above. Why do you care?

I think we care about local media, but not about remote media, maybe?

erikjohnston and others added 2 commits July 2, 2026 09:54
Co-authored-by: Eric Eastwood <erice@element.io>
@erikjohnston

Copy link
Copy Markdown
Member Author

fwiw, we spotted the remote media bug after the merge but didn't consider it important enough to fix.

I mean, fair, but I'm not a fan of leaving around broken code. I reckon it missed the majority of remote media.

(It is also causing the local media to be handled at ~0.5Hz)

@erikjohnston erikjohnston merged commit cb60c15 into develop Jul 2, 2026
45 of 47 checks passed
@erikjohnston erikjohnston deleted the erikj/fix/flag-existing-quarantined-media branch July 2, 2026 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Abuse Reports, media quarantine, policy servers, etc A-Media-Repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants