Skip to content

Commit 439b24d

Browse files
v0.3.0: speed, progress clarity, image extraction, file skip
- Skip HEAD when no min/max image size (faster downloads) - Lower default delay 1.0→0.5s; more parallel workers (5 assets, 4 head) - Clear progress: Found N, Downloading N assets, [i/N] per item - Image extraction: preload links, background-image, path hints, lazy attrs - File skip: canonical paths, skip already scraped resources - Auto retry crawl cross-domain if same-domain empty - GUI: Stop button, status parsing, last URL persisted - Fix: tqdm unit spacing, multiprocessing warning Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent cacd8f0 commit 439b24d

9 files changed

Lines changed: 1040 additions & 286 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,5 +10,6 @@ build/
1010
venv/
1111
env/
1212
output/
13+
output_*/
1314
*.log
1415
.DS_Store

CHANGELOG.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,27 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66

77
## [Unreleased]
88

9+
## [0.3.0] - 2025-02-04
10+
11+
### Added
12+
- Clear progress output: "Found: X PDFs, Y images", "→ Downloading N assets", "[i/N]" per item.
13+
- Image extraction: `link[rel=preload][as=image]`, CSS `background-image: url()`, extension-less paths (e.g. `/image/`, `/thumb/`), more lazy-load attrs (`data-zoom-src`, `data-hires`, etc.).
14+
- GUI: Stop button, status parsing for mapping/download progress, `[i/N]` display.
15+
- Skip HEAD when no `--min-image-size` / `--max-image-size` (faster image downloads).
16+
- Auto retry crawl with cross-domain if same-domain returns no results.
17+
- File skip: check canonical paths and skip already-scraped images/PDFs/text.
18+
- Last URL persisted on GUI relaunch.
19+
- Multiprocessing semaphore warning suppression; tqdm unit spacing fix.
20+
21+
### Changed
22+
- Default delay: 1.0s → 0.5s.
23+
- Workers: `SAFE_ASSET_WORKERS` 3→5, `SAFE_HEAD_WORKERS` 2→4; parallel HEAD threshold 8→4.
24+
- Crawl follows links by default; "Follow links" checkbox clarified.
25+
- README: min-image-size tip; `output_*/` in .gitignore.
26+
27+
### Removed
28+
- GUI progress bar and spinner (replaced by clearer status text).
29+
930
## [0.2.0] - 2025-02-04
1031

1132
- GUI (tkinter) with file-type selector, image size filter, and Open folder button.

README.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ This installs the package in editable mode and registers the `scrape` and `scrap
3232
scrape --url https://example.com/page [--out-dir output] [--delay 1] [--crawl] [--max-depth 2] [--same-domain-only]
3333
```
3434

35-
Filter images by file size (uses HEAD `Content-Length`): `--min-image-size 50k` and/or `--max-image-size 5m` (suffixes `k`/`m` for KB/MB).
35+
Filter images by file size (uses HEAD `Content-Length`): `--min-image-size 50k` and/or `--max-image-size 5m` (suffixes `k`/`m` for KB/MB). Use a low or zero minimum to capture thumbnails; a high minimum (e.g. `1m`) skips smaller images.
3636

3737
Or open the simple GUI:
3838

@@ -76,6 +76,17 @@ For **crawl** mode, the scraper auto-detects CPU count and caps parallel workers
7676
scrape --url https://example.com --crawl --workers 2
7777
```
7878

79+
### Iterations and auto timeout (single-page)
80+
81+
On 403 or slow responses, the scraper retries automatically:
82+
83+
- **Iterations:** Single-page runs retry up to `--max-iterations` (default 3). Each iteration uses a longer delay and timeout; if the first attempt gets 403, the next iteration automatically uses the browser (`--js`) when Playwright is installed.
84+
- **Auto timeout:** Per-request timeout scales with each retry (30s → 60s → 120s, cap 120s). Suggested default is 120s max; override base with a custom timeout in code if needed.
85+
86+
```bash
87+
scrape --url https://strict.site/page --max-iterations 5
88+
```
89+
7990
## Building a standalone bundle
8091

8192
To build a standalone folder with the CLI and GUI (no Python required on the target machine):

0 commit comments

Comments
 (0)