Fix resume — skip chapter download when content file already exists by apetros · Pull Request #2968 · lncrawl/lightnovel-crawler

apetros · 2026-05-13T12:13:50Z

Re-running lncrawl crawl --noin --all -f epub <URL> on a half-finished novel re-downloaded every single already-saved chapter instead of resuming. With 1016/1734 chapters on disk and a long crawl interrupted by Cloudflare, I was watching it start over from zero every time.

The skip condition in CrawlerService.fetch_chapter looks fine at first glance:

if (
    not refresh
    and chapter.is_available
    and chapter.extra.get("crawler_version") == crawler.version
):
    return chapter   # skip

…but crawler_version is a declared parameter on CrawlerChapter (lncrawl/core/models.py:123), and _ModelBox.get_extras() returns only kwargs beyond declared params. So when fetch_chapter does chapter.extra.update(model.get_extras()) at save time, crawler_version is filtered out — it's never written to the JSON extra blob. chapter.extra.get("crawler_version") is None forever, never equals the int crawler.version, and the skip never fires.

Verified empirically on the test DB: every completed chapter has extra = {}. Nothing was being written.

There's a deeper hole behind this — crawler.version is set to int(file.st_mtime) of the source crawler file (lncrawl/services/sources/helper.py:152), which changes on git checkout, OneDrive sync, an editor save, etc. Even if the persistence were fixed, the comparison would be fragile. Storing the version in a proper DAO column is the long-term answer, but that's a migration and out of scope for "make resume work."

This PR keeps the change minimal: trust the on-disk file. If the chapter file exists, skip the download. Anyone who really wants to re-fetch already has --refresh.

if not refresh and chapter.is_available:
    return chapter

Test plan

Run lncrawl crawl --noin --all -f <fmt> <URL> on a partially-downloaded novel. Confirm the progress bar zooms through the already-on-disk chapters (≈1000+ c/s) and only the missing ones actually hit the network.
Verified locally: a 1734-chapter novel with 1016 on-disk chapters processed the first 1020 in under a second, then slowed to live download speed for the rest.

The previous skip condition compared chapter.extra["crawler_version"] against the live crawler.version, but the version was never actually written into extra — get_extras() omits declared init params on CrawlerChapter, so update(model.get_extras()) added nothing. The check was always False, every chapter re-downloaded on every run. Trust file presence on disk for the skip. If a user really needs to re-download a stored chapter they already have --refresh.

dipu-bd · 2026-05-13T13:02:28Z

-            and chapter.is_available
-            and chapter.extra.get("crawler_version") == crawler.version
-        ):
+        if not refresh and chapter.is_available:


if crawler version does not match that means it could fetch different content and should not use the cache. for example: a crawler can add an additional cleaner, and chapter content can be more polished with it in the next fetch.

dipu-bd reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix resume — skip chapter download when content file already exists#2968

Fix resume — skip chapter download when content file already exists#2968
apetros wants to merge 1 commit into
lncrawl:devfrom
apetros:fix/chapter-resume-skip

apetros commented May 13, 2026

Uh oh!

dipu-bd May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

apetros commented May 13, 2026

Test plan

Uh oh!

dipu-bd May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dipu-bd May 13, 2026 •

edited

Loading