add middleware for request prioritization #33951

bohde · 2025-03-20T18:06:17Z

This adds a middleware for overload protection, that is intended to help protect against malicious scrapers. It does this via codel, which will perform the following:

Limit the number of in-flight requests to some user defined max
When in-flight requests have reached their max, begin queuing requests, with logged in requests having priority above logged out requests
Once a request has been queued for too long, it has a percentage chance to be rejected based upon how overloaded the entire system is.

When a server experiences more traffic than it can handle, this has the effect of keeping latency low for logged in users, while rejecting just enough requests from logged out users to keep the service from being overloaded.

Below are benchmarks showing a system at 100% capacity and 200% capacity in a few different configurations. The 200% capacity is shown to highlight an extreme case. I used hey to simulate the bot traffic:

hey -z 1m -c 10 "http://localhost:3000/rowan/demo/issues?state=open&type=all&labels=&milestone=0&project=0&assignee=0&poster=0&q=fix"

The concurrency of 10 was chosen from experiments where my local server began to experience higher latency.

Results

Method	Concurrency	p95 latency	Successful RPS	Requests Dropped
QoS Off	10	0.2960s	44 rps	0%
QoS Off	20	0.5667s	44 rps	0%
QoS On	20	0.4409s	48 rps	10%
QoS On 50% Logged In*	10	0.3891s	33 rps	7%
QoS On 50% Logged Out*	10	2.0388s	13 rps	6%

Logged in users were given the additional parameter -H "Cookie: i_like_gitea=<session>.

Tests with * were run at the same time, representing a workload with mixed logged in & logged out users. Results are separated to show prioritization, and how logged in users experience a 100ms latency increase under load, compared to the 1.6 seconds logged out users see.

routers/common/qos.go

wxiaoguang · 2025-03-21T00:24:33Z

Results ....

TBH, according to your test result, I do not see it is really useful .......

a1012112796

I think at first step, just distinguish between login and not is enough. and looks it can be added for the api router also.

bohde · 2025-03-21T00:50:30Z

TBH, according to your test result, I do not see it is really useful .......

We've been running this patch for awhile, and while it takes a bit of tuning to get the right settings for a given server it has alleviated a lot of outage concerns from waves of scrapers that don't respect robots.txt.

I think at first step, just distinguish between login and not is enough. and looks it can be added for the api router also.

This was in my initial version we ran, but then users who are attempting to login can still experience lower priority traffic. Since the major source of traffic for public code hosting are scrapers trying to download content, deprioritizing those routes mitigates that concern.

wxiaoguang · 2025-03-21T00:53:11Z

TBH, according to your test result, I do not see it is really useful .......

We've been running this patch for awhile, and while it takes a bit of tuning to get the right settings for a given server it has alleviated a lot of outage concerns from waves of scrapers that don't respect robots.txt.

So the original intention is to fight with crawlers/spiders/robots?

bohde · 2025-03-21T00:57:56Z

TBH, according to your test result, I do not see it is really useful .......

We've been running this patch for awhile, and while it takes a bit of tuning to get the right settings for a given server it has alleviated a lot of outage concerns from waves of scrapers that don't respect robots.txt.

So the original intention is to fight with crawlers/spiders/robots?

Yes, I mentioned this in my description

This adds a middleware for overload protection, that is intended to help protect against malicious scrapers

wxiaoguang · 2025-03-21T01:05:51Z

Yes, I mentioned this in my description

Sorry, missed that part. The "perform the following" part and "result" part attracted my attention .....

If the intention is to "protect against malicious scrapers", I think the parts := []string{ could be improved, a similar function like ParseGiteaSiteURL could help the route path handling. And I think we need some tests to cover the new middleware's behavior.

Or like https://github.com/go-gitea/gitea/pull/33951/files#r2006668195 said, mark the routes with priority

wxiaoguang · 2025-03-21T01:29:14Z

I think I have some ideas about how to handle these "route paths" clearly. Could I push some commits into your branch?

bohde · 2025-03-21T01:35:40Z

I think I have some ideas about how to handle these "route paths" clearly. Could I push some commits into your branch?

Could you send a PR to my branch, or sketch it out in another commit? This would prevent merge conflicts on my branch

wxiaoguang · 2025-03-21T02:21:29Z

sketch it out in another commit

Sorry, don't quite understand what it exactly means ....

Could you send a PR to my branch

use route pattern bohde/gitea#1 (update: Close since there is no interest after 2 weeks)

We can use chi router's "RoutePattern", and make code testable.

bohde · 2025-04-07T19:27:14Z

update: Close since there is no interest after 2 weeks

Just getting back to this now as I was dealing with other priorities. I could incorporate the above changes, but the near complete rewrite of my PR makes it difficult to isolate the changes you are asking, and makes it incompatible with the production implementation we are running. This makes it difficult to test these changes against the live traffic we are seeing to ensure they still perform well.

This adds a middleware for overload protection, that is intended to help protect against malicious scrapers. It does this by via [`codel`](https://github.com/bohde/codel), which will perform the following: 1. Limit the number of in-flight requests to some user defined max 2. When in-flight requests have reached their max, begin queuing requests, with logged in requests having priority above logged out requests 3. Once a request has been queued for too long, it has a percentage chance to be rejected based upon how overloaded the entire system is. When a server experiences more traffic than it can handle, this has the effect of keeping latency low for logged in users, while rejecting just enough requests from logged out users to keep the service from being overloaded. Below are benchmarks showing a system at 100% capacity and 200% capacity in a few different configurations. The 200% capacity is shown to highlight an extreme case. I used [hey](https://github.com/rakyll/hey) to simulate the bot traffic: ``` hey -z 1m -c 10 "http://localhost:3000/rowan/demo/issues?state=open&type=all&labels=&milestone=0&project=0&assignee=0&poster=0&q=fix" ``` The concurrency of 10 was chosen from experiments where my local server began to experience higher latency. Results | Method | Concurrency | p95 latency | Successful RPS | Requests Dropped | |--------|--------|--------|--------|--------| | QoS Off | 10 | 0.2960s | 44 rps | 0% | | QoS Off | 20 | 0.5667s | 44 rps | 0%| | QoS On | 20 | 0.4409s | 48 rps | 10% | | QoS On 50% Logged In* | 10 | 0.3891s | 33 rps | 7% | | QoS On 50% Logged Out* | 10 | 2.0388s | 13 rps | 6% | Logged in users were given the additional parameter ` -H "Cookie: i_like_gitea=<session>`. Tests with `*` were run at the same time, representing a workload with mixed logged in & logged out users. Results are separated to show prioritization, and how logged in users experience a 100ms latency increase under load, compared to the 1.6 seconds logged out users see.

bohde · 2025-04-07T23:17:52Z

I applied review feedback, and synthesized into something that is easily backportable to 1.23 by avoiding APIs added in main. @wxiaoguang's PR did miss some important endpoints we sometimes see heavy traffic on since it didn't check every possible delimited portion of the subpath against the string in my original implementation. However, because the RoutePattern can be pulled directly in this middleware using @wxiaoguang's method, this can all be simplified to the following priority scheme:

Logged In	Repo Context	Priority
Yes	-	High
No	Yes	Low
No	No	Default

This allows key flows such as login and explore to still be prioritized well, with only the repo specific endpoints likely to be throttled. In my experience, this catches the vast majority of what scrapers are targeting.

routers/common/qos.go

a early draft to try split registerWebRoutes, it's too long now. maybe this change will be usefull for go-gitea#1872 , go-gitea#33951 ... Signed-off-by: a1012112796 <[email protected]>

bohde · 2025-04-10T01:07:16Z

You haven't tried the user-redirection behavior by my proposal in Gitea. I am pretty sure the result should be the same as your 503 behavior (CPU load). You can try it in your production to see if I was wrong. If I am wrong, I am happy to accept the conclusion and learn why.

I haven't seen your proposed code, but I have prototyped a version of it by serving a redirect to the login page and testing the behavior in local benchmarks in #33951 (comment), and it is worse than the 503 method, because it now needs to also handle the login page request. These results are inline with my intuition of how it works in #33951 (comment). Ultimately, I can only ever see this introducing a regression in both performance and latency in a production environment because it strictly adds work that a server needs to do.

If I understand you, you think the added redirect will provide a nicer user experience, while the additional load of the login page will not meaningfully have a regression. Because this algorithm only ever rejects traffic when the server is already overloaded, any load added at this point, even if it is a small, makes this overloaded status worse, and causes more traffic to be rejected. This is a similar, but slightly different, failure mode as a retry storm.

You haven't tried the Proof-of-Work or rate-limit-token behavior by my proposal. I am pretty sure it will provide the best user experience among these behaviors.

I'd certainly look at these, but I don't the goals of those are necessarily relevant to this PR. My goal in this PR is to keep friction low for anonymous users and crawlers that respect our robots.txt, which is why it only drops traffic if it needs to.

wxiaoguang · 2025-04-10T01:15:41Z

I haven't seen your proposed code,

If you have no objection, I will propose one.

but I have prototyped a version of it by serving a redirect to the login page and testing the behavior in local benchmarks in #33951 (comment), and it is worse than the 503 method, because it now needs to also handle the login page request.

I think I have explained above: the test is just a test, production result speaks.

I'd certainly look at these, but I don't the goals of those are necessarily relevant to this PR. My goal in this PR is to keep friction low for anonymous users and crawlers that respect our robots.txt, which is why it only drops traffic if it needs to.

That's what I suggested above: I will complete it and make it configurable to make site admin could choose the behavior they need.

I think I could propose a "Proof-of-Work or rate-limit-token" solution and you could try it in your production to see whether it is better.

bohde · 2025-04-10T01:26:03Z

If you have no objection, I will propose one.

If you're fine merging this one, I would look at a follow up.

I think I could propose a "Proof-of-Work or rate-limit-token" solution and you could try it in your production to see whether it is better.

I would however need to see the proposed code before I could determine if I could test this in a production environment. Either way, I'm happy to take a look.

wxiaoguang · 2025-04-10T01:30:34Z

If you have no objection, I will propose one.

If you're fine merging this one, I would look at a follow up.

Just like I said: I have no objection to this, public site admin maintainers could help, maybe: @techknowlogick @lunny

wxiaoguang · 2025-04-10T01:33:48Z

I would however need to see the proposed code before I could determine if I could test this in a production environment. Either way, I'm happy to take a look.

OK, I will propose one (maybe replace this one while achieve the same or better result)

bohde · 2025-04-10T01:43:32Z

I would however need to see the proposed code before I could determine if I could test this in a production environment. Either way, I'm happy to take a look.

OK, I will propose one (maybe replace this one while achieve the same or better result)

I would need to see it of course, but I don't think the goals described in #33966 (comment) are not the same as this one. I think they may be complementary, but in our case we have public repos that we want both indexed by search engines and to be accessible by users who are not logged in, and want to keep friction as low as possible for those cases.

I would prefer to merge this as is, and address any others in a follow up.

wxiaoguang · 2025-04-10T02:29:16Z

-> Rate-limit anonymous accesses to expensive pages when the server is overloaded #34167

I think this one covers all the cases (ps: benchmark and local tests aren't the crawler's behavior, so I think production result speaks, but I don't have a public instance)

bohde · 2025-04-10T03:20:23Z

-> Rate-limit anonymous accesses to expensive pages when the server is overloaded #34167

I think this one covers all the cases.

The algorithm there is clearly a derivative of the one I am proposing here, but it rejects more requests in practice since it does not enqueue requests. It also takes part of this PR, and introduces it in a new one, while stripping my authorship. This is evident in details such as the default config setting, which is copy pasted from this PR, but really only makes sense in the context of this algorithm.

I've done a lot of legwork on this PR, including testing, benchmarking and refining it on production crawler traffic. I'm always open to feedback and ideas on how to improve an implementation, but have found this process very frustrating, from the initial dismissive response, the ongoing bikeshedding, and now the rewrite while stripping my authorship.

wxiaoguang · 2025-04-10T03:31:52Z

The algorithm there is clearly a derivative of the one I am proposing here, but it rejects more requests in practice since it does not enqueue requests.

I have explained many times: production speaks. You can't say that it is worse without testing it in real production environment. The crawler's behavior is different for your local testing.

It also takes part of this PR, and introduces it in a new one, while stripping my authorship. This is evident in details such as the default config setting, which is copy pasted from this PR, but really only makes sense in the context of this algorithm.

I don't understand, if you are talking about "authorship", you copied my routePattern := rctx.RoutePattern() and other related code, where is the authorship? Please show me what to do, I will follow you and respect your authorship.

And by the way, "really only makes sense in the context of this algorithm" it is not true, I have explained in that PR's description. The "queue" work could be done by the browser and crawler.

I've done a lot of legwork on this PR, including testing, benchmarking and refining it on production crawler traffic. I'm always open to feedback and ideas on how to improve an implementation, but have found this process very frustrating, from the initial dismissive response, the ongoing bikeshedding, and now the rewrite while stripping my authorship.

I also found that it is frustrating to fix various problems caused by various design problems (more than a half of these these 1400 PRs https://github.com/go-gitea/gitea/pulls/wxiaoguang are about "fixing this", "fixing that")

I have asked many times that "showing 503 blank error page is not friendly to end users", but no reasonable response. You just kept ignoring this fact.

Again, for the "authorship", please show me that how to respect the authorship for your copy of my routePattern := rctx.RoutePattern() and then I will follow it.

Thank you very much.

wxiaoguang · 2025-04-10T03:41:21Z

By the way, that PR #34167 is only an initial work, there are still some TODOs, and if it looks good, you could still integrate the codel package to use the builtin queue (Controlled Delay algorithm) to delay some requests. I didn't do that because I don't want make it more complex at first stage, and I know the codel work is from you, and I respect your authorship and decided to leave that improvement to you.

bohde · 2025-04-10T04:57:51Z

By the way, that PR #34167 is only an initial work, there are still some TODOs, and if it looks good, you could still integrate the codel package to use the builtin queue (Controlled Delay algorithm) to delay some requests. I didn't do that because I don't want make it more complex at first stage, and I know the codel work is from you, and I respect your authorship and decided to leave that improvement to you.

If that is the intent, then that PR should follow this one. This implements general overload protection, not just crawler mitigations. For example, if a logged in user performed a crawl, overload protection would still kick in, and keep latency down for all logged in users. This PR aims to solve a more general problem of keeping the server from having an outage.

This implementation does complement other crawler mitigation techniques, but they would likely benefit from extensions to this middleware.

lunny · 2025-04-10T05:13:53Z

Thank you both for your contributions!

Regarding the 503 page: I believe it’s better to show an internal custom 503 page which is better than showing a blank 503 page to anonymous visitors when the site is overloaded. A blank 503 often gives the impression that it comes from a reverse proxy and the site is completely down, which might discourage users from returning.

As for authorship, when a pull request includes commits from multiple contributors, a co-author trailer is automatically added upon merging. So we don’t need to worry about proper attribution in that case.

routers/common/qos.go

--------- Co-authored-by: wxiaoguang <[email protected]>

…age.

bohde · 2025-04-10T15:18:26Z

Regarding the 503 page: I believe it’s better to show an internal custom 503 page which is better than showing a blank 503 page to anonymous visitors when the site is overloaded. A blank 503 often gives the impression that it comes from a reverse proxy and the site is completely down, which might discourage users from returning.

Thank you for the the feedback! I have implemented a custom 503 page when the client requests HTML in 5078f67. In an effort to keep the response light in some cases, it does fallback to a plain text response if the user doesn't request HTML.

As for authorship, when a pull request includes commits from multiple contributors, a co-author trailer is automatically added upon merging. So we don’t need to worry about proper attribution in that case.

I'm happy to take this to another issue, but my frustration is around new PRs that @wxiaoguang has opened, with rewritten commits from this PR, while not maintaining the commit history and thus my authorship. For example, I feel that #34167 is largely based on this code, and has copy+pasted some parts of it, even if the goals are not the same. Additionally, #34024 seems to also have taken portions of this PR, rewritten it in a new commit, and removed my authorship. For example, the middleware in 3ae9466#diff-e12ddfac3974ba21f1c9c7389e69beccb1503a67f7e12803d7eb52b6fb133bfcR18 has a lot of similarities to the middleware in my original commit f3096c1#diff-ccb5640b0d6352850437df3f2f9b6416a667f4233d5c7b98d5ad2ad871a3fea1R58, including classifying similar endpoints as either "low priority" or "expensive" and checking for long polling in a similar manner. My commit predates @wxiaoguang's by 6 days, and they were aware of it since they reviewed it the same day I submitted my PR. Since #34024 not reference my commit or this PR, I wasn't aware of its existence until just recently.

wxiaoguang · 2025-04-10T15:24:33Z

I have made a PR for you: bohde#1 , there was just no response in 2 weeks, but users were waiting. TBH I have no interest to waste time on inactive or stale PRs. At least, in history I had to help to improve some of your PRs or propose some following up fixes for your PRs. A PR with "Maintainers are allowed to edit this pull request" should be open to maintainers to edit. I don't understand why you complain this time.

wxiaoguang · 2025-04-10T15:26:34Z

with rewritten commits from this PR, while not maintaining the commit history and thus my authorship.

That's not a rewritten, I write everyline from scratch. Hopefully you can understand. So there is no commit history.

Again, as I said, your PR (this PR) DIDN'T include any authorship of me, EITHER. So why complain? You just added (force-pushed) some new commits later after you complained.

wxiaoguang · 2025-04-10T15:34:45Z

And, one more thing, I just found this your new commit: if the client requests HTML, render an HTML 503 Service Unavailable page : 5078f67 , it is almost my "code" to render the template from another PR: if you complain I copied your code, to be honest : have your new commit included my authorship? Could I say or complain that you copied my code?

GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Mar 20, 2025

pull-request-size bot added the size/L label Mar 20, 2025

github-actions bot added modifies/go Pull requests that update Go code modifies/dependencies docs-update-needed The document needs to be updated synchronously labels Mar 20, 2025

bohde force-pushed the rb/request-qos branch 3 times, most recently from f3096c1 to 6c499a9 Compare March 20, 2025 18:14

wxiaoguang reviewed Mar 21, 2025

View reviewed changes

routers/common/qos.go Outdated Show resolved Hide resolved

a1012112796 reviewed Mar 21, 2025

View reviewed changes

wxiaoguang added the type/feature Completely new functionality. Can only be merged if feature freeze is not active. label Mar 21, 2025

wxiaoguang added this to the 1.24.0 milestone Mar 21, 2025

delvh mentioned this pull request Mar 26, 2025

Setting to disable expensive endpoints for anonymous users #33966

Closed

bohde force-pushed the rb/request-qos branch 2 times, most recently from 351be38 to 7cbcef3 Compare April 7, 2025 22:57

bohde requested review from wxiaoguang and a1012112796 April 7, 2025 23:18

wxiaoguang reviewed Apr 8, 2025

View reviewed changes

routers/common/qos.go Outdated Show resolved Hide resolved

bohde requested review from lunny, techknowlogick and a team April 10, 2025 04:23

TheFox0x7 reviewed Apr 10, 2025

View reviewed changes

routers/common/qos.go Show resolved Hide resolved

bohde and others added 5 commits April 10, 2025 09:05

apply review feedback

aec549d

--------- Co-authored-by: wxiaoguang <[email protected]>

assign repo home page default priority

8099576

use service.qos block to configure qos settings

bde2fd1

use qos block to configure QOS

7e69399

if the client requests HTML, render an HTML 503 Service Unavailable p…

5078f67

…age.

bohde force-pushed the rb/request-qos branch from e386e52 to 5078f67 Compare April 10, 2025 14:08

github-actions bot added modifies/translation modifies/templates This PR modifies the template files labels Apr 10, 2025

fix indentation

be160b0

techknowlogick removed the size/L label Apr 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add middleware for request prioritization #33951

add middleware for request prioritization #33951

bohde commented Mar 20, 2025 •

edited

Loading

wxiaoguang commented Mar 21, 2025

a1012112796 left a comment

bohde commented Mar 21, 2025

wxiaoguang commented Mar 21, 2025

bohde commented Mar 21, 2025

wxiaoguang commented Mar 21, 2025 •

edited

Loading

wxiaoguang commented Mar 21, 2025

bohde commented Mar 21, 2025

wxiaoguang commented Mar 21, 2025 •

edited

Loading

bohde commented Apr 7, 2025

bohde commented Apr 7, 2025

bohde commented Apr 10, 2025 •

edited

Loading

wxiaoguang commented Apr 10, 2025

bohde commented Apr 10, 2025

wxiaoguang commented Apr 10, 2025

wxiaoguang commented Apr 10, 2025

bohde commented Apr 10, 2025 •

edited

Loading

wxiaoguang commented Apr 10, 2025 •

edited

Loading

bohde commented Apr 10, 2025

wxiaoguang commented Apr 10, 2025

wxiaoguang commented Apr 10, 2025 •

edited

Loading

bohde commented Apr 10, 2025

lunny commented Apr 10, 2025

bohde commented Apr 10, 2025

wxiaoguang commented Apr 10, 2025

wxiaoguang commented Apr 10, 2025 •

edited

Loading

wxiaoguang commented Apr 10, 2025

add middleware for request prioritization #33951

Are you sure you want to change the base?

add middleware for request prioritization #33951

Conversation

bohde commented Mar 20, 2025 • edited Loading

wxiaoguang commented Mar 21, 2025

a1012112796 left a comment

Choose a reason for hiding this comment

bohde commented Mar 21, 2025

wxiaoguang commented Mar 21, 2025

bohde commented Mar 21, 2025

wxiaoguang commented Mar 21, 2025 • edited Loading

wxiaoguang commented Mar 21, 2025

bohde commented Mar 21, 2025

wxiaoguang commented Mar 21, 2025 • edited Loading

bohde commented Apr 7, 2025

bohde commented Apr 7, 2025

bohde commented Apr 10, 2025 • edited Loading

wxiaoguang commented Apr 10, 2025

bohde commented Apr 10, 2025

wxiaoguang commented Apr 10, 2025

wxiaoguang commented Apr 10, 2025

bohde commented Apr 10, 2025 • edited Loading

wxiaoguang commented Apr 10, 2025 • edited Loading

bohde commented Apr 10, 2025

wxiaoguang commented Apr 10, 2025

wxiaoguang commented Apr 10, 2025 • edited Loading

bohde commented Apr 10, 2025

lunny commented Apr 10, 2025

bohde commented Apr 10, 2025

wxiaoguang commented Apr 10, 2025

wxiaoguang commented Apr 10, 2025 • edited Loading

wxiaoguang commented Apr 10, 2025

bohde commented Mar 20, 2025 •

edited

Loading

wxiaoguang commented Mar 21, 2025 •

edited

Loading

wxiaoguang commented Mar 21, 2025 •

edited

Loading

bohde commented Apr 10, 2025 •

edited

Loading

bohde commented Apr 10, 2025 •

edited

Loading

wxiaoguang commented Apr 10, 2025 •

edited

Loading

wxiaoguang commented Apr 10, 2025 •

edited

Loading

wxiaoguang commented Apr 10, 2025 •

edited

Loading