-
Notifications
You must be signed in to change notification settings - Fork 15
Fix broken link checking CI and broken decoupler pseudobulk link #236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nonono. What you’re doing is the coding equivalent of breaking into a house because you forgot your umbrella back when you were invited.
Before we deceptively pretend we’re a browser, how about trying this in polite ways?
- We should set a custom user agent. This is the most normal and basic thing every single HTTP bot should do!
- We could do a HEAD request instead of a GET. IDK if HTTPX downloads the full page when used like this, but doing a HEAD request should still work and signals that we don’t even want a lot of data.
- If all that fails, we should ask RTD nicely to be removed from their blocklist.
If all that fails and RTD can’t recommend any other recourse, then we can try this.
Thanks for the analogy, initially I just wanted to fix a broken link and might have gotten carried away. Though I did try both 1 and 2 first and they were insufficient, in my defense. So with 1 and 2 down, I guess we will have to try 3! Since I do not have any official role within scverse and no email address, perhaps you could reach out @flying-sheep? Form is available at: https://app.readthedocs.org/support/ Alternatively, I could create an issue on https://github.com/readthedocs/readthedocs.org/issues if okay with you. |
I just filed an issue. I think support is more for when your own hosted service gets taken down: readthedocs/readthedocs.org#12471 |
Yeah, I think it's just that scverse-tutorials doesn't get the love it deserves. |
@maltekuehl why did you try to use a browser-like user agent? That’s already deceptive. No wonder this would be blocked! Check out my commits: You’re supposed to use a completely custom one: https://user-agents.net/bots It still got a 403, but probably because we ended up on some blocklist or so. /edit: I guess I know why. It’s insane, when you search the web for “httpx user agent”, the first two results for me are blogspam articles that blatantly tell you you should “fake” user agents and “avoid detection” instead of doing the right thing. Makes me angry:
only the third tutorial actually tells you the right thing: https://proxiesapi.com/articles/customizing-httpx-user-agents-for-effective-api-requests |
I tried with a completely custom one, too, though it did not end up in commit history. That was also locally from my IP and GitHub IPs are also not specific to this repository, so we did not end up on a block list, it just never worked in any configuration. Since the 403 error is non-obvious (rather than a 429), the initial objective when trying out different fixes was to find out what was causing this problem. I am still not 100% sure the 403 is even related to the AI bot block, aa in other configurations I was getting a 429 instead, which would be the more expected response. |
That makes sense! Good point about the block list and GH IPs. Let’s wait for RTD to respond! |
Automated Nova fix for scverse#236 **Original PR**: scverse#236 **Base**: `scverse/scverse-tutorials@main` **Source SHA**: `5f0a3df5ed1e7da0f3cecd966247680e07eb7440` **Nova Hint**: fix link checker CI; update broken links; target minimal 2-file fix **Nova Mode**: local This patch aims to be minimal and CI-verifiable. 🤖 Generated with [Nova CI-Rescue](https://github.com/anthropics/nova-ci-rescue) Co-Authored-By: Nova <[email protected]>
It seems that RTD blocked bot-like requests and introduced rate limits, which resulted in the original pipeline failing. To be more respectful of RTD and fix this issue, I have adapted the CI to instantiate a Chromium browser for requests and just use a HEAD instead of a GET request, as this will not download the page's data. I have also added a timeout.
The problem was another example of AI companies making the web worse for everyone: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/
Additionally, I have fixed a broken decoupler link. Current link is broken, see https://scverse.org/learn/
One question I have is why the failing CI was not caught for 4 months? I see multiple people watching the repository. If it was caught and ignored because of priority, no problem, just if it was not caught, that might be something to fix.
@grst @Zethson