Skip to content

MSC4140: rate limit management against delay_id#19830

Open
AndrewFerr wants to merge 19 commits into
element-hq:developfrom
AndrewFerr:msc4140-ratelimit-delay_id
Open

MSC4140: rate limit management against delay_id#19830
AndrewFerr wants to merge 19 commits into
element-hq:developfrom
AndrewFerr:msc4140-ratelimit-delay_id

Conversation

@AndrewFerr

@AndrewFerr AndrewFerr commented Jun 5, 2026

Copy link
Copy Markdown
Member

Do this instead of rate limiting based on IP address, to avoid the possibility of multiple unauthenticated clients on a shared IP address getting put into the same rate-limit quota and getting unfairly blocked.

This follows the suggestion made by the MSC: https://github.com/matrix-org/matrix-spec-proposals/blob/toger5/expiring-events-keep-alive/proposals/4140-delayed-events-futures.md#rate-limiting-for-heartbeats

Depends on #19794. Either that should be merged first, or this PR should absorb it. (Perhaps this would be a good use-case for stacked PRs.)

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

This is to allow authed requests to have their own ratelimit quotas.
Plus, since element-hq#19152, the delayed event management ratelimit hasn't
considered the requesting device ID, so don't mention that anymore.
@AndrewFerr AndrewFerr requested a review from a team as a code owner June 5, 2026 16:07
@AndrewFerr AndrewFerr marked this pull request as draft June 5, 2026 16:15
@AndrewFerr AndrewFerr marked this pull request as ready for review June 8, 2026 15:46
Do this instead of rate limiting based on IP address, to avoid the
possibility of multiple unauthenticated clients on a shared IP address
getting put into the same rate-limit quota and getting unfairly blocked.

This follows the suggestion made by the MSC:
https://github.com/matrix-org/matrix-spec-proposals/blob/toger5/expiring-events-keep-alive/proposals/4140-delayed-events-futures.md#rate-limiting-for-heartbeats
For unauthenticated requests to delayed event management endpoints,
use different rate limits for different source IP addresses, instead of
having all such requests share the same limit (per delay_id).
Currently, it should be awaited only by the /cancel and /send delayed
event management endpoints, but not /restart, which can be run by
workers that do not set _initialized_from_db and thus can't await it.
@AndrewFerr AndrewFerr force-pushed the msc4140-ratelimit-delay_id branch from 8a56dfe to 3745bd7 Compare June 10, 2026 14:44
Comment thread synapse/handlers/delayed_events.py
anoadragon453 and others added 5 commits June 18, 2026 10:58
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
Also update docstring on _mgmt_ratelimit to mention how it now keys
ratelimits on delay_id + requester
@AndrewFerr

Copy link
Copy Markdown
Member Author

If it would help with the review, I could rebase this PR's branch to exclude commits from #19794, as that is now merged.

This comment was marked as resolved.

"""
assert self._is_master
await self._mgmt_ratelimit(request)
await self._mgmt_ratelimit(request, delay_id)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this weaken the rate-limiting too much? Now that a user can spam requests, as long as they cycle through (effectively unlimited) delay IDs?

I get the reasoning that an external service would end up getting rate-limited if it's managing lots of delay IDs for a bunch of users... but would it not make sense to just exclude such a service from rate-limiting?

And finally, I get that lots of logged-out users on the same corporate VPN would end up getting rate-limited due to sharing one IP... but at this point, what is the rate-limit protecting against? Seems it's only stopping a buggy client from spamming the same endpoint accidentally?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this weaken the rate-limiting too much? Now that a user can spam requests, as long as they cycle through (effectively unlimited) delay IDs?

The worst that such spammy requests would do is make the server look up from a database index (to check for the existence of the target delayed event). IMO this is an acceptable cost to allow for the kind of rate-limiting this PR is after, but if that's still too costly, we can consider another approach.

would it not make sense to just exclude such a service from rate-limiting?

That's not a bad idea, but then the question is how Synapse would identify such an external service, given the "service" is currently just any arbitrary HTTP client that's been given some delay IDs.

what is the rate-limit protecting against? Seems it's only stopping a buggy client from spamming the same endpoint accidentally?

Yes, be it a buggy or malicious client. It takes non-trivial work to reset a delayed event (past the initial quick check for the existence of the delayed event), so it's worth something to protect it with a rate limit.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It bears mentioning that the long-term solution to all of this is to have the external services be OAuth 2.0 clients, with a token that has scoped access to these management endpoints (detailed in MSC4140). Rate limits could then be applied per client, as the clients would be identifiable by their token. The idea of using delay_id grouping is just a stopgap measure to apply some kind of rate limit in the absence of a way to identify an unauthed external service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants