Skip to content

buzzcauldron/strigil

Repository files navigation

strigil

Strigil: PDFs, text, and images from websites at high quality, stored locally.

Repository: strigil · Author: Seth Strickland · License: MIT

License

MIT License. Copyright (c) 2025 Seth Strickland. See LICENSE.

Install and run

From the project directory:

pip install -e .

This installs the package in editable mode and registers the strigil and strigil-gui console scripts. You can then run:

strigil --url https://example.com/page [URL2 ...] [--out-dir output] [--delay 1] [--crawl] [--max-depth 2] [--same-domain-only]

Archival mode (default): Images are filtered for archival quality: default --min-image-size 100k, skips GIFs, deduplicates thumbnail/full-size, and skips banners/logos/social icons. Use --all-images to include all images including thumbnails.

Filter by size: --min-image-size 50k and/or --max-image-size 5m (suffixes k/m for KB/MB). Use --all-images to disable the default 100k minimum.

Or open the simple GUI:

strigil-gui

Use --js for JS-heavy pages (e.g. NYPL Digital Collections). Playwright and Chromium are installed automatically with the package; on first --js use, Chromium is downloaded if needed.

Supported archival sources: NYPL Digital Collections, CONTENTdm, IIIF manifests, HathiTrust (babel.hathitrust.org), EEBO (ProQuest), ECCO (Gale), Internet Archive (full IIIF + metadata API fallback), Wellcome Collection (Catalogue API → IIIF manifest), Stanford PURL, Digital Bodleian, Library of Congress.

Discovery hints: Use --expected-images N when a page should have ~N images (triggers fallbacks if fewer found). Use --source ADAPTER to force a specific adapter (e.g. wellcome, archive_org) when auto-detection fails.

Optional: install tqdm for a progress bar (per-page in crawl, per-asset on single page):

pip install -e ".[progress]"

Use --no-progress to disable the bar (e.g. in scripts).

Use --keep-awake to prevent system/display sleep during long scrapes. On Linux this requires the systemd-inhibit binary (usually provided by your distro's systemd package, e.g. sudo apt install systemd). If you use keep-awake on Linux and it's not installed, the app prints an optional install hint.

If dependencies are missing

When you run strigil or strigil-gui, the app auto-installs missing dependencies on first run. Required deps (httpx, beautifulsoup4, lxml) are installed and the app exits—run the command again. Optional deps (Playwright, tqdm, readability-lxml) are installed and the app continues. Set STRIGIL_AUTO_INSTALL_DEPS=0 to disable auto-install.

  • From PyPI: pip install strigil
  • From source: pip install -e . (in the project directory)

If Playwright is not installed, an optional hint is shown for JS rendering (--js).

Workers and hardware autodetect

For crawl mode, the scraper auto-detects CPU count and caps parallel workers (default: up to 12) for faster scraping. Override with --workers N. To see detected hardware (CPU, memory if available, suggested workers), run strigil --hardware.

Faster crawl and scrape: Use --aggressiveness aggressive (or --workers 12 --delay 0.15) for maximum speed. More workers = more pages in parallel; lower delay = less wait between requests. Per-page asset downloads and image HEADs also run with higher parallelism (up to 8 assets, 6 HEADs).

Aggressiveness (auto from hardware and power): Use --aggressiveness auto (default) to let the scraper pick conservative, balanced, or aggressive. On AC power with strong hardware it suggests aggressive; on battery it throttles by battery %—below 20% always conservative, 50%+ may allow balanced if hardware allows. Run strigil --hardware to see power and battery % and the suggested preset.

strigil --url https://example.com --crawl --workers 6 --delay 0.3
strigil --url https://example.com --crawl --aggressiveness aggressive

Cloudflare and bot protection

Some sites use Cloudflare, DDoS-GUARD, or similar protection and may return 403, challenge pages, or block non-browser traffic. The built-in --js mode (Playwright) can help on JS-heavy or bot-detecting sites, but it does not solve Cloudflare challenges.

Recommended: FlareSolverr — The best way to handle Cloudflare is FlareSolverr. It solves challenges in a dedicated headless browser and returns cleared HTML; no manual step and no headed browser required.

  1. Run FlareSolverr (e.g. Docker):
    • Docker Compose (recommended): docker compose up -d flaresolverr — uses pinned version from docker-compose.yml; run docker compose pull flaresolverr && docker compose up -d flaresolverr to update. Renovate will open PRs when new FlareSolverr versions are released.
    • Plain Docker: docker run -d -p 8191:8191 ghcr.io/flaresolverr/flaresolverr:latest
  2. Enable it when scraping:
    • Auto-engage: If FlareSolverr is running at http://localhost:8191 (or FLARESOLVERR_URL), the scraper automatically retries via FlareSolverr when it detects Cloudflare (403 or challenge page). No flag needed.
    • CLI (explicit): strigil --url https://... --flaresolverr (uses http://localhost:8191 or FLARESOLVERR_URL)
    • CLI with custom URL: strigil --url https://... --flaresolverr http://host:8191
    • Env: set FLARESOLVERR_URL=http://localhost:8191 and pass --flaresolverr
    • GUI: check “FlareSolverr (Cloudflare bypass)” and optionally set the FlareSolverr API URL.

When FlareSolverr is used (auto or explicit), page HTML is requested through it; asset downloads (images, PDFs) still use the normal fetcher. With crawl, the scraper uses one worker when FlareSolverr is enabled (same as with --js).

Alternative: human bypass — If FlareSolverr is not available or a site still blocks it, use --human-bypass to solve the challenge yourself in a visible browser (see Cloudflare: human bypass below).

Rate limits and automatic throttling

When the scraper receives HTTP 429 (Too Many Requests) or a 200 response whose body indicates a rate limit (e.g. “Rate limit reached”, “too many requests”), it:

  • Waits before retrying: uses the Retry-After header when present (seconds or HTTP-date), otherwise 30–60s.
  • Throttles subsequent requests: a per-Fetcher delay is applied so all following requests (HTML, images, HEADs) are slowed until the backend recovers. The delay decays gradually after successful responses.

502/503/504 (Bad Gateway, Service Unavailable, Gateway Timeout) are retried up to 6 times with a 5s base wait so flaky upstream servers (e.g. IIIF image servers) often succeed on retry.

Failed assets: If particular images or PDFs time out or fail after retries, the scraper records them and runs a retry pass after the main download (longer timeout, sequential). Use --no-retry-failed to skip this pass, or --retry-timeout 120 to set the retry timeout in seconds (default 90). Still-failed URLs are written to output/<domain>/failed_urls.txt and file names to errata. Retry later with --retry-from output/<domain>/failed_urls.txt.

Sites like Archive-It that return a rate-limit message in the HTML body are handled the same way: wait, retry, and throttle.

Iterations and auto timeout (single-page)

On 403 or slow responses, the scraper retries automatically:

  • Iterations: Single-page runs retry up to --max-iterations (default 3). Each iteration uses a longer delay and timeout; if the first attempt gets 403, the next iteration automatically uses the browser (--js) when Playwright is installed. For sites behind Cloudflare or similar protection, see Cloudflare and bot protection above.
  • Auto timeout: Per-request timeout scales with each retry (30s → 60s → 120s, cap 120s). Suggested default is 120s max; override base with a custom timeout in code if needed.
strigil --url https://strict.site/page --max-iterations 5

Cloudflare: human bypass

When FlareSolverr is not an option (e.g. not installed or the site still blocks it), use --human-bypass to solve the Cloudflare challenge yourself in a visible browser:

strigil --url https://syri.ac/digimss/... --human-bypass --no-robots --crawl --max-depth 3
  1. A browser window opens and loads the page.
  2. If Cloudflare appears, solve the challenge (e.g. click "Verify you are human").
  3. When the real page has loaded, return to the terminal and press Enter.
  4. Strigil continues scraping using your authenticated session.

--human-bypass implies --js and uses a headed browser. For crawl mode, omit --same-domain-only to follow cross-domain links to manuscript viewers.

Building a standalone bundle

To build a standalone folder with the CLI and GUI (no Python required on the target machine):

pip install -e ".[bundle]"
pyinstaller strigil.spec

Output is in dist/strigil/: run strigil or strigil-gui from that folder. The GUI uses the bundled strigil executable in the same directory when you click Scrape.

Install packages (Mac, Windows, Linux)

Build an install package for the current platform (folder + archive):

Platform Script Output
macOS ./scripts/build_mac.sh dist/strigil-mac.zip
Linux ./scripts/build_linux.sh dist/strigil-linux.tar.gz
Windows scripts\build_windows.bat dist\strigil-win.zip

Each script runs pip install -e ".[bundle]", pyinstaller strigil.spec, then creates the archive. Unzip (or unpack the tarball) and run strigil or strigil-gui from the strigil folder.

Docker

Light image (CLI only, no GUI):

docker build -t strigil .
docker run --rm -v "$(pwd)/output:/strigil/output" strigil --url https://example.com --out-dir /strigil/output

Override the default URL and options by passing args after the image name.

CI: build all OS and Docker

On push/PR to main or master, GitHub Actions:

  • Builds PyInstaller bundles on Ubuntu, macOS, and Windows and uploads:
    • strigil-<os> – the dist/strigil/ folder
    • strigil-<os>-install – install package: strigil-win.zip, strigil-mac.zip, or strigil-linux.tar.gz
  • Builds the Docker image and runs a quick smoke test.

See .github/workflows/build.yml.

Version and release

  • Version check: python scripts/check_version.py verifies pyproject.toml matches CHANGELOG (run before release).
  • Auto-release: python-semantic-release is configured. Use conventional commits (feat:, fix:, BREAKING CHANGE:) for automatic version bumps on push to main.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages