Practical guide for quickly fixing broken scraping when a source site changes.
Primary job:
- Make small, surgical tweaks (usually CSS selectors or minor parsing logic) so headlines and links for a given site work again.
- Do not redesign architecture; use the existing tools and patterns.
Backend fetch choice:
- Whether a site uses plain HTTP, Selenium, or (future) Playwright is decided centrally (e.g. in
*_report_settings.pyand shared fetch helpers). - For now, assume Selenium is used where a headless browser is required; you generally do not need to change how that is selected.
- Focus on: given the HTML/DOM we already fetch, extract the right items via minimal, local changes.
Audience:
- Humans and LLM agents maintaining feed scraping.
- Focused, step-by-step, and aligned with existing tooling:
site_debugger.pytest_site_debug.pycustom_site_handlers.pyseleniumfetch.py*_report_settings.py
Use this checklist when a site shows missing or empty items:
-
Confirm only one/few sites are broken:
- If all feeds are empty → check network, config, or global logic.
- If only specific sites are empty or weird → likely per-site scraping issue.
-
Check logs:
- Look in:
linuxreport.log- Any relevant service logs (systemd, web server)
- Search for:
- HTTP errors (403/404/500)
- Timeouts
- Parsing errors
- Messages from
seleniumfetch.py/playwrightfetch.pyabout empty content or failed selectors.
- Look in:
If logs confirm "no entries found" or DOM mismatch for a specific site, continue below.
site_debugger.py is your primary tool to see what LinuxReport actually receives from a site.
Key guarantees:
- The first debug run (via Selenium or requests) will:
- Fetch the page once.
- Save the full HTML snapshot (and optional JS/console/report) to disk.
- All subsequent selector/parsing experiments for that snapshot MUST:
- Use the saved HTML file only.
- Avoid additional network calls for the same investigation.
This is enforced via:
site_debugger.py DebugConfig()+site_debugger.py SiteDebugger.debug_requests()/site_debugger.py SiteDebugger.debug_selenium()for the initial fetch.site_debugger.py SiteDebugger.debug_from_html()for all follow-up analysis using the saved HTML.
Typical workflow (examples; adjust arguments to your env):
-
Run debugger against a problematic URL (one network call):
- Use
SiteDebugger.debug_requests()orSiteDebugger.debug_selenium():- Fetch the page similarly to production.
- Save
{site}_debug_*.htmlplus related artifacts. - Optionally use the same user agent, headers, and Tor/Selenium/Playwright paths.
- Use
-
Inspect outputs:
- Look for:
- HTTP status.
- Final URL after redirects.
- Snippets of HTML around the expected headline/link elements.
- Whether content is loaded server-side or via JavaScript only.
- Look for:
-
Iterate on selectors locally, no more network:
- Point
debug_from_html()at the saved HTML file. - Adjust CSS selectors / parsing logic until results look correct.
- This loop never re-fetches the live site.
- Point
-
Decide:
- If HTML contains the items in clear tags, this is a selector/parsing issue.
- If HTML is mostly empty but browser shows content:
- Site is JS-heavy → should use
seleniumfetch.pyorplaywrightfetch.py.
- Site is JS-heavy → should use
- If you see bot-blocking or captcha pages:
- Consider Tor, different user agents, or backing off that source.
Use test_site_debug.py to:
- Validate that
site_debugger.pysemantics stay correct. - Confirm that once an HTML snapshot exists, re-runs reuse the local file instead of performing new fetches (see the VentureBeat example).
- Provide regression protection when modifying debugging utilities.
Most breakages are due to minor HTML structure changes.
Key places:
-
custom_site_handlers.py:- Site-specific parsing logic and normalizers.
- Add/update handlers here rather than hacking generic code.
-
image_parser.pyandimage_utils.py:- Responsible for image candidate extraction and scoring.
- Adjust only if image selection is broken across sites or for a class of layouts.
General approach:
- Use
site_debugger.pyto capture current HTML. - Identify the new CSS selectors / DOM patterns for:
- Article containers
- Title/URL
- (Optional) summary, date, image
- Implement or adjust a handler in
custom_site_handlers.py:- Keep it narrowly scoped to that domain.
- Reuse shared helpers where possible.
- Update the relevant
*_report_settings.pyif:- Feed endpoints changed.
- Site switched from RSS to HTML-only or vice versa.
Then:
- Run targeted tests:
pytest tests/test_extract_titles.pypytest tests/test_dedup.py(if you changed how items are normalized)
Most fixes do not involve changing how sites are fetched.
Key points:
- The decision to use Selenium vs plain HTTP is made centrally (e.g. in
*_report_settings.pyand shared fetch utilities). - For broken sites, assume:
- The correct fetch path (including Selenium when needed) is already chosen.
- Your job is to adjust how we parse the HTML/DOM returned by that path, usually in
custom_site_handlers.py.
Only consider fetch-path changes if:
- Logs +
site_debugger.pyclearly show we are consistently hitting bot-block pages, CAPTCHAs, or empty shells that cannot be parsed. - In that rare case:
- Update the relevant
*_report_settings.pyand/or shared fetch config following existing patterns. - Keep logic centralized; do not add ad-hoc Selenium calls scattered around.
- Run
pytest tests/test_browser_switch.pyandpytest tests/selenium_test.pyif you changed fetch-selection logic.
- Update the relevant
Your pipelines already log hints when scraping returns no entries.
When a site breaks:
- Search logs for:
- That site’s domain
- Messages like:
- "no entries found"
- "failed to extract"
- timeouts or HTTP status
- Map log messages back to:
custom_site_handlers.pyseleniumfetch.pyplaywrightfetch.py- relevant
*_report_settings.pyentry
Then:
- Adjust the appropriate handler or configuration.
- Re-run:
site_debugger.pyfor that site- Targeted tests as in previous sections
Follow these constraints to keep the system robust:
-
Keep site-specific logic isolated:
- Prefer functions/blocks in
custom_site_handlers.pykeyed by domain or pattern. - Avoid hardcoding site-specific behavior deep inside generic code.
- Prefer functions/blocks in
-
Maintain consistent data shape:
- Handlers should output entries that look like existing feedparser-based entries.
- This ensures deduplication, templates, and downstream logic keep working.
-
Let central config drive Selenium usage:
- Assume Selenium vs HTTP choice is already correct unless logs/debugging prove otherwise.
- Reuse existing Selenium helpers; do not implement your own driver management.
-
Respect caching:
- If you change scraping logic:
- Consider whether caches should be invalidated or keys adjusted.
- Avoid designs that cause cache stampedes or bypass caches.
- If you change scraping logic:
-
Validate with tests:
- Always run:
pytest tests/test_extract_titles.py
- And any specialized tests matching the components you modified.
- Always run:
When a site breaks:
-
Confirm it’s isolated:
- Compare with other sites; inspect
linuxreport.log.
- Compare with other sites; inspect
-
Debug fetch:
- Run
site_debugger.pyfor that URL. - Inspect HTML/DOM and HTTP status.
- Run
-
Fix logic:
- Update or add handler in
custom_site_handlers.py. - If JS-heavy, configure Selenium/Playwright via
*_report_settings.pyand existing fetch helpers.
- Update or add handler in
-
Re-check locally:
- Run
site_debugger.pyagain. - Confirm expected entries appear.
- Run
-
Run targeted tests:
pytest tests/test_extract_titles.pypytest tests/test_dedup.py(if normalization changed)- Selenium/Playwright tests if those paths were touched.
-
Deploy and monitor:
- Watch logs for the domain.
- Verify entries appear on the live site.
This document is intentionally scoped and practical. From agents.md, link to Scraping.md when scraping-specific work is needed, and use this guide as the canonical playbook for fixing broken sites.