Skip to content

Conversation

maltekuehl
Copy link

@maltekuehl maltekuehl commented Sep 11, 2025

It seems that RTD blocked bot-like requests and introduced rate limits, which resulted in the original pipeline failing. To be more respectful of RTD and fix this issue, I have adapted the CI to instantiate a Chromium browser for requests and just use a HEAD instead of a GET request, as this will not download the page's data. I have also added a timeout.

The problem was another example of AI companies making the web worse for everyone: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/

Additionally, I have fixed a broken decoupler link. Current link is broken, see https://scverse.org/learn/

One question I have is why the failing CI was not caught for 4 months? I see multiple people watching the repository. If it was caught and ignored because of priority, no problem, just if it was not caught, that might be something to fix.

@grst @Zethson

@maltekuehl maltekuehl changed the title Fix broken decoupler pseudobulk link Fix broken link checking CI and broken decoupler pseudobulk link Sep 11, 2025
Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nonono. What you’re doing is the coding equivalent of breaking into a house because you forgot your umbrella back when you were invited.

Before we deceptively pretend we’re a browser, how about trying this in polite ways?

  1. We should set a custom user agent. This is the most normal and basic thing every single HTTP bot should do!
  2. We could do a HEAD request instead of a GET. IDK if HTTPX downloads the full page when used like this, but doing a HEAD request should still work and signals that we don’t even want a lot of data.
  3. If all that fails, we should ask RTD nicely to be removed from their blocklist.

If all that fails and RTD can’t recommend any other recourse, then we can try this.

@maltekuehl
Copy link
Author

maltekuehl commented Sep 11, 2025

Thanks for the analogy, initially I just wanted to fix a broken link and might have gotten carried away. Though I did try both 1 and 2 first and they were insufficient, in my defense.

So with 1 and 2 down, I guess we will have to try 3! Since I do not have any official role within scverse and no email address, perhaps you could reach out @flying-sheep? Form is available at: https://app.readthedocs.org/support/

Alternatively, I could create an issue on https://github.com/readthedocs/readthedocs.org/issues if okay with you.

@flying-sheep
Copy link
Member

I just filed an issue. I think support is more for when your own hosted service gets taken down: readthedocs/readthedocs.org#12471

@grst
Copy link
Collaborator

grst commented Sep 12, 2025

One question I have is why the failing CI was not caught for 4 months? I see multiple people watching the repository. If it was caught and ignored because of priority, no problem, just if it was not caught, that might be something to fix.

Yeah, I think it's just that scverse-tutorials doesn't get the love it deserves.

@flying-sheep
Copy link
Member

flying-sheep commented Sep 12, 2025

@maltekuehl why did you try to use a browser-like user agent? That’s already deceptive. No wonder this would be blocked! Check out my commits: You’re supposed to use a completely custom one: https://user-agents.net/bots

It still got a 403, but probably because we ended up on some blocklist or so.

/edit: I guess I know why. It’s insane, when you search the web for “httpx user agent”, the first two results for me are blogspam articles that blatantly tell you you should “fake” user agents and “avoid detection” instead of doing the right thing. Makes me angry:

only the third tutorial actually tells you the right thing: https://proxiesapi.com/articles/customizing-httpx-user-agents-for-effective-api-requests

@maltekuehl
Copy link
Author

I tried with a completely custom one, too, though it did not end up in commit history. That was also locally from my IP and GitHub IPs are also not specific to this repository, so we did not end up on a block list, it just never worked in any configuration. Since the 403 error is non-obvious (rather than a 429), the initial objective when trying out different fixes was to find out what was causing this problem. I am still not 100% sure the 403 is even related to the AI bot block, aa in other configurations I was getting a 429 instead, which would be the more expected response.

@flying-sheep
Copy link
Member

That makes sense! Good point about the block list and GH IPs. Let’s wait for RTD to respond!

seabass011 pushed a commit to seabass011/scverse-tutorials that referenced this pull request Sep 13, 2025
Automated Nova fix for scverse#236

**Original PR**: scverse#236
**Base**: `scverse/scverse-tutorials@main`
**Source SHA**: `5f0a3df5ed1e7da0f3cecd966247680e07eb7440`
**Nova Hint**: fix link checker CI; update broken links; target minimal 2-file fix
**Nova Mode**: local

This patch aims to be minimal and CI-verifiable.

🤖 Generated with [Nova CI-Rescue](https://github.com/anthropics/nova-ci-rescue)

Co-Authored-By: Nova <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants