Skip to content

Run BrowserTemplate fallback browser in headless mode#2967

Merged
dipu-bd merged 1 commit into
lncrawl:devfrom
apetros:fix/headless-browser-fallback
May 13, 2026
Merged

Run BrowserTemplate fallback browser in headless mode#2967
dipu-bd merged 1 commit into
lncrawl:devfrom
apetros:fix/headless-browser-fallback

Conversation

@apetros
Copy link
Copy Markdown
Contributor

@apetros apetros commented May 13, 2026

Hit this while running a long --all crawl of readnovelfull.com/godly-stay-home-dad-v1.html. Around chapter 1017 the HTTP scraper started failing (almost certainly Cloudflare kicking in after ~40 minutes of sustained requests). BrowserTemplate._override_scraper_get_soup caught the ScraperErrorGroup and tried to fall back to Selenium-driven Chrome — at which point everything wedged.

Cause: BrowserTemplate.browser constructs the fallback as Browser(cookie_store=self.scraper.cookies) with no headless flag, so it defaults to a visible window. My session is GNOME on Wayland with Xwayland (XAUTHORITY=/run/user/1000/.mutter-Xwaylandauth.XXXXX); the Chrome subprocess Selenium spawns can't authenticate to the X server, prints Authorization required, but no authorization protocol specified, and the driver dies with NoSuchDriverException. The fallback holds self._lock (an EventLock) while Chrome retries with a 120s page-load timeout, so concurrent worker threads pile up and trip TimeoutError('Failed to acquire semaphore') from TaskManager. Throughput collapsed from 2.3 s/chapter to 122 s/chapter and the run was effectively dead.

I couldn't find anywhere in the codebase that actually needs a visible window for the fallback — _override_scraper_get_soup, _override_scraper_get_image, and _override_scraper_get_json all just want the resulting HTML / screenshot / JSON. Forcing headless=True here side-steps the X11 dance and matches what already happens automatically on machines with no display (webdriver/local.py:42-43 flips headless on when Platform.has_display is false — but on Wayland that check returns true via tkinter even when Chrome itself can't reach the server).

One-line fix:

browser = Browser(cookie_store=self.scraper.cookies, headless=True)

Test plan

  • On a Linux Wayland desktop, re-run a --all crawl long enough to trigger the fallback. Confirm Chrome starts (headless) and the run continues instead of dying with NoSuchDriverException + semaphore timeouts.
  • Sanity-check that nothing else (sources that explicitly drive the browser) is broken by the forced headless mode.

When the HTTP scraper raises a ScraperErrorGroup, BrowserTemplate falls
back to a real Chrome session. It was constructed without a headless
flag, so it defaulted to a visible window. There's no codepath in CLI or
server runs that actually interacts with that window, and on
Wayland/Xwayland sessions the subprocess often can't authenticate to
the X server, which kills the fallback entirely.
@dipu-bd dipu-bd merged commit a92013b into lncrawl:dev May 13, 2026
6 checks passed
@dipu-bd
Copy link
Copy Markdown
Collaborator

dipu-bd commented May 13, 2026

soon the entire browser will be replaced by https://github.com/ultrafunkamsterdam/nodriver

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants