MSC4140: rate limit management against delay_id#19830
Conversation
This is to allow authed requests to have their own ratelimit quotas.
Plus, since element-hq#19152, the delayed event management ratelimit hasn't considered the requesting device ID, so don't mention that anymore.
Do this instead of rate limiting based on IP address, to avoid the possibility of multiple unauthenticated clients on a shared IP address getting put into the same rate-limit quota and getting unfairly blocked. This follows the suggestion made by the MSC: https://github.com/matrix-org/matrix-spec-proposals/blob/toger5/expiring-events-keep-alive/proposals/4140-delayed-events-futures.md#rate-limiting-for-heartbeats
For unauthenticated requests to delayed event management endpoints, use different rate limits for different source IP addresses, instead of having all such requests share the same limit (per delay_id).
Currently, it should be awaited only by the /cancel and /send delayed event management endpoints, but not /restart, which can be run by workers that do not set _initialized_from_db and thus can't await it.
8a56dfe to
3745bd7
Compare
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
Also update docstring on _mgmt_ratelimit to mention how it now keys ratelimits on delay_id + requester
|
If it would help with the review, I could rebase this PR's branch to exclude commits from #19794, as that is now merged. |
| """ | ||
| assert self._is_master | ||
| await self._mgmt_ratelimit(request) | ||
| await self._mgmt_ratelimit(request, delay_id) |
There was a problem hiding this comment.
Does this weaken the rate-limiting too much? Now that a user can spam requests, as long as they cycle through (effectively unlimited) delay IDs?
I get the reasoning that an external service would end up getting rate-limited if it's managing lots of delay IDs for a bunch of users... but would it not make sense to just exclude such a service from rate-limiting?
And finally, I get that lots of logged-out users on the same corporate VPN would end up getting rate-limited due to sharing one IP... but at this point, what is the rate-limit protecting against? Seems it's only stopping a buggy client from spamming the same endpoint accidentally?
There was a problem hiding this comment.
Does this weaken the rate-limiting too much? Now that a user can spam requests, as long as they cycle through (effectively unlimited) delay IDs?
The worst that such spammy requests would do is make the server look up from a database index (to check for the existence of the target delayed event). IMO this is an acceptable cost to allow for the kind of rate-limiting this PR is after, but if that's still too costly, we can consider another approach.
would it not make sense to just exclude such a service from rate-limiting?
That's not a bad idea, but then the question is how Synapse would identify such an external service, given the "service" is currently just any arbitrary HTTP client that's been given some delay IDs.
what is the rate-limit protecting against? Seems it's only stopping a buggy client from spamming the same endpoint accidentally?
Yes, be it a buggy or malicious client. It takes non-trivial work to reset a delayed event (past the initial quick check for the existence of the delayed event), so it's worth something to protect it with a rate limit.
There was a problem hiding this comment.
It bears mentioning that the long-term solution to all of this is to have the external services be OAuth 2.0 clients, with a token that has scoped access to these management endpoints (detailed in MSC4140). Rate limits could then be applied per client, as the clients would be identifiable by their token. The idea of using delay_id grouping is just a stopgap measure to apply some kind of rate limit in the absence of a way to identify an unauthed external service.
Do this instead of rate limiting based on IP address, to avoid the possibility of multiple unauthenticated clients on a shared IP address getting put into the same rate-limit quota and getting unfairly blocked.
This follows the suggestion made by the MSC: https://github.com/matrix-org/matrix-spec-proposals/blob/toger5/expiring-events-keep-alive/proposals/4140-delayed-events-futures.md#rate-limiting-for-heartbeats
Depends on #19794. Either that should be merged first, or this PR should absorb it. (Perhaps this would be a good use-case for stacked PRs.)
Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.