Run BrowserTemplate fallback browser in headless mode#2967
Merged
Conversation
When the HTTP scraper raises a ScraperErrorGroup, BrowserTemplate falls back to a real Chrome session. It was constructed without a headless flag, so it defaulted to a visible window. There's no codepath in CLI or server runs that actually interacts with that window, and on Wayland/Xwayland sessions the subprocess often can't authenticate to the X server, which kills the fallback entirely.
Collaborator
|
soon the entire browser will be replaced by https://github.com/ultrafunkamsterdam/nodriver |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hit this while running a long
--allcrawl ofreadnovelfull.com/godly-stay-home-dad-v1.html. Around chapter 1017 the HTTP scraper started failing (almost certainly Cloudflare kicking in after ~40 minutes of sustained requests).BrowserTemplate._override_scraper_get_soupcaught theScraperErrorGroupand tried to fall back to Selenium-driven Chrome — at which point everything wedged.Cause:
BrowserTemplate.browserconstructs the fallback asBrowser(cookie_store=self.scraper.cookies)with noheadlessflag, so it defaults to a visible window. My session is GNOME on Wayland with Xwayland (XAUTHORITY=/run/user/1000/.mutter-Xwaylandauth.XXXXX); the Chrome subprocess Selenium spawns can't authenticate to the X server, printsAuthorization required, but no authorization protocol specified, and the driver dies withNoSuchDriverException. The fallback holdsself._lock(anEventLock) while Chrome retries with a 120s page-load timeout, so concurrent worker threads pile up and tripTimeoutError('Failed to acquire semaphore')fromTaskManager. Throughput collapsed from 2.3 s/chapter to 122 s/chapter and the run was effectively dead.I couldn't find anywhere in the codebase that actually needs a visible window for the fallback —
_override_scraper_get_soup,_override_scraper_get_image, and_override_scraper_get_jsonall just want the resulting HTML / screenshot / JSON. Forcingheadless=Truehere side-steps the X11 dance and matches what already happens automatically on machines with no display (webdriver/local.py:42-43flips headless on whenPlatform.has_displayis false — but on Wayland that check returns true via tkinter even when Chrome itself can't reach the server).One-line fix:
Test plan
--allcrawl long enough to trigger the fallback. Confirm Chrome starts (headless) and the run continues instead of dying withNoSuchDriverException+ semaphore timeouts.