Strigil: PDFs, text, and images from websites at high quality, stored locally.
Repository: strigil · Author: Seth Strickland · License: MIT
MIT License. Copyright (c) 2025 Seth Strickland. See LICENSE.
From the project directory:
pip install -e .This installs the package in editable mode and registers the strigil and strigil-gui console scripts. You can then run:
strigil --url https://example.com/page [URL2 ...] [--out-dir output] [--delay 1] [--crawl] [--max-depth 2] [--same-domain-only]Archival mode (default): Images are filtered for archival quality: default --min-image-size 100k, skips GIFs, deduplicates thumbnail/full-size, and skips banners/logos/social icons. Use --all-images to include all images including thumbnails.
Filter by size: --min-image-size 50k and/or --max-image-size 5m (suffixes k/m for KB/MB). Use --all-images to disable the default 100k minimum.
Or open the simple GUI:
strigil-guiUse --js for JS-heavy pages (e.g. NYPL Digital Collections). Playwright and Chromium are installed automatically with the package; on first --js use, Chromium is downloaded if needed.
Supported archival sources: NYPL Digital Collections, CONTENTdm, IIIF manifests, HathiTrust (babel.hathitrust.org), EEBO (ProQuest), ECCO (Gale), Internet Archive (full IIIF + metadata API fallback), Wellcome Collection (Catalogue API → IIIF manifest), Stanford PURL, Digital Bodleian, Library of Congress.
Discovery hints: Use --expected-images N when a page should have ~N images (triggers fallbacks if fewer found). Use --source ADAPTER to force a specific adapter (e.g. wellcome, archive_org) when auto-detection fails.
Optional: install tqdm for a progress bar (per-page in crawl, per-asset on single page):
pip install -e ".[progress]"Use --no-progress to disable the bar (e.g. in scripts).
Use --keep-awake to prevent system/display sleep during long scrapes. On Linux this requires the systemd-inhibit binary (usually provided by your distro's systemd package, e.g. sudo apt install systemd). If you use keep-awake on Linux and it's not installed, the app prints an optional install hint.
When you run strigil or strigil-gui, the app auto-installs missing dependencies on first run. Required deps (httpx, beautifulsoup4, lxml) are installed and the app exits—run the command again. Optional deps (Playwright, tqdm, readability-lxml) are installed and the app continues. Set STRIGIL_AUTO_INSTALL_DEPS=0 to disable auto-install.
- From PyPI:
pip install strigil - From source:
pip install -e .(in the project directory)
If Playwright is not installed, an optional hint is shown for JS rendering (--js).
For crawl mode, the scraper auto-detects CPU count and caps parallel workers (default: up to 12) for faster scraping. Override with --workers N. To see detected hardware (CPU, memory if available, suggested workers), run strigil --hardware.
Faster crawl and scrape: Use --aggressiveness aggressive (or --workers 12 --delay 0.15) for maximum speed. More workers = more pages in parallel; lower delay = less wait between requests. Per-page asset downloads and image HEADs also run with higher parallelism (up to 8 assets, 6 HEADs).
Aggressiveness (auto from hardware and power): Use --aggressiveness auto (default) to let the scraper pick conservative, balanced, or aggressive. On AC power with strong hardware it suggests aggressive; on battery it throttles by battery %—below 20% always conservative, 50%+ may allow balanced if hardware allows. Run strigil --hardware to see power and battery % and the suggested preset.
strigil --url https://example.com --crawl --workers 6 --delay 0.3
strigil --url https://example.com --crawl --aggressiveness aggressiveSome sites use Cloudflare, DDoS-GUARD, or similar protection and may return 403, challenge pages, or block non-browser traffic. The built-in --js mode (Playwright) can help on JS-heavy or bot-detecting sites, but it does not solve Cloudflare challenges.
Recommended: FlareSolverr — The best way to handle Cloudflare is FlareSolverr. It solves challenges in a dedicated headless browser and returns cleared HTML; no manual step and no headed browser required.
- Run FlareSolverr (e.g. Docker):
- Docker Compose (recommended):
docker compose up -d flaresolverr— uses pinned version fromdocker-compose.yml; rundocker compose pull flaresolverr && docker compose up -d flaresolverrto update. Renovate will open PRs when new FlareSolverr versions are released. - Plain Docker:
docker run -d -p 8191:8191 ghcr.io/flaresolverr/flaresolverr:latest
- Docker Compose (recommended):
- Enable it when scraping:
- Auto-engage: If FlareSolverr is running at
http://localhost:8191(orFLARESOLVERR_URL), the scraper automatically retries via FlareSolverr when it detects Cloudflare (403 or challenge page). No flag needed. - CLI (explicit):
strigil --url https://... --flaresolverr(useshttp://localhost:8191orFLARESOLVERR_URL) - CLI with custom URL:
strigil --url https://... --flaresolverr http://host:8191 - Env: set
FLARESOLVERR_URL=http://localhost:8191and pass--flaresolverr - GUI: check “FlareSolverr (Cloudflare bypass)” and optionally set the FlareSolverr API URL.
- Auto-engage: If FlareSolverr is running at
When FlareSolverr is used (auto or explicit), page HTML is requested through it; asset downloads (images, PDFs) still use the normal fetcher. With crawl, the scraper uses one worker when FlareSolverr is enabled (same as with --js).
- FlareSolverr: https://github.com/FlareSolverr/FlareSolverr — proxy server to bypass Cloudflare and DDoS-GUARD protection (MIT license).
Alternative: human bypass — If FlareSolverr is not available or a site still blocks it, use --human-bypass to solve the challenge yourself in a visible browser (see Cloudflare: human bypass below).
When the scraper receives HTTP 429 (Too Many Requests) or a 200 response whose body indicates a rate limit (e.g. “Rate limit reached”, “too many requests”), it:
- Waits before retrying: uses the Retry-After header when present (seconds or HTTP-date), otherwise 30–60s.
- Throttles subsequent requests: a per-Fetcher delay is applied so all following requests (HTML, images, HEADs) are slowed until the backend recovers. The delay decays gradually after successful responses.
502/503/504 (Bad Gateway, Service Unavailable, Gateway Timeout) are retried up to 6 times with a 5s base wait so flaky upstream servers (e.g. IIIF image servers) often succeed on retry.
Failed assets: If particular images or PDFs time out or fail after retries, the scraper records them and runs a retry pass after the main download (longer timeout, sequential). Use --no-retry-failed to skip this pass, or --retry-timeout 120 to set the retry timeout in seconds (default 90). Still-failed URLs are written to output/<domain>/failed_urls.txt and file names to errata. Retry later with --retry-from output/<domain>/failed_urls.txt.
Sites like Archive-It that return a rate-limit message in the HTML body are handled the same way: wait, retry, and throttle.
On 403 or slow responses, the scraper retries automatically:
- Iterations: Single-page runs retry up to
--max-iterations(default 3). Each iteration uses a longer delay and timeout; if the first attempt gets 403, the next iteration automatically uses the browser (--js) when Playwright is installed. For sites behind Cloudflare or similar protection, see Cloudflare and bot protection above. - Auto timeout: Per-request timeout scales with each retry (30s → 60s → 120s, cap 120s). Suggested default is 120s max; override base with a custom timeout in code if needed.
strigil --url https://strict.site/page --max-iterations 5When FlareSolverr is not an option (e.g. not installed or the site still blocks it), use --human-bypass to solve the Cloudflare challenge yourself in a visible browser:
strigil --url https://syri.ac/digimss/... --human-bypass --no-robots --crawl --max-depth 3- A browser window opens and loads the page.
- If Cloudflare appears, solve the challenge (e.g. click "Verify you are human").
- When the real page has loaded, return to the terminal and press Enter.
- Strigil continues scraping using your authenticated session.
--human-bypass implies --js and uses a headed browser. For crawl mode, omit --same-domain-only to follow cross-domain links to manuscript viewers.
To build a standalone folder with the CLI and GUI (no Python required on the target machine):
pip install -e ".[bundle]"
pyinstaller strigil.specOutput is in dist/strigil/: run strigil or strigil-gui from that folder. The GUI uses the bundled strigil executable in the same directory when you click Scrape.
Build an install package for the current platform (folder + archive):
| Platform | Script | Output |
|---|---|---|
| macOS | ./scripts/build_mac.sh |
dist/strigil-mac.zip |
| Linux | ./scripts/build_linux.sh |
dist/strigil-linux.tar.gz |
| Windows | scripts\build_windows.bat |
dist\strigil-win.zip |
Each script runs pip install -e ".[bundle]", pyinstaller strigil.spec, then creates the archive. Unzip (or unpack the tarball) and run strigil or strigil-gui from the strigil folder.
Light image (CLI only, no GUI):
docker build -t strigil .
docker run --rm -v "$(pwd)/output:/strigil/output" strigil --url https://example.com --out-dir /strigil/outputOverride the default URL and options by passing args after the image name.
On push/PR to main or master, GitHub Actions:
- Builds PyInstaller bundles on Ubuntu, macOS, and Windows and uploads:
- strigil-<os> – the
dist/strigil/folder - strigil-<os>-install – install package:
strigil-win.zip,strigil-mac.zip, orstrigil-linux.tar.gz
- strigil-<os> – the
- Builds the Docker image and runs a quick smoke test.
See .github/workflows/build.yml.
- Version check:
python scripts/check_version.pyverifies pyproject.toml matches CHANGELOG (run before release). - Auto-release: python-semantic-release is configured. Use conventional commits (
feat:,fix:,BREAKING CHANGE:) for automatic version bumps on push to main.