Skip to content

Fix resume — skip chapter download when content file already exists#2968

Open
apetros wants to merge 1 commit into
lncrawl:devfrom
apetros:fix/chapter-resume-skip
Open

Fix resume — skip chapter download when content file already exists#2968
apetros wants to merge 1 commit into
lncrawl:devfrom
apetros:fix/chapter-resume-skip

Conversation

@apetros
Copy link
Copy Markdown
Contributor

@apetros apetros commented May 13, 2026

Re-running lncrawl crawl --noin --all -f epub <URL> on a half-finished novel re-downloaded every single already-saved chapter instead of resuming. With 1016/1734 chapters on disk and a long crawl interrupted by Cloudflare, I was watching it start over from zero every time.

The skip condition in CrawlerService.fetch_chapter looks fine at first glance:

if (
    not refresh
    and chapter.is_available
    and chapter.extra.get("crawler_version") == crawler.version
):
    return chapter   # skip

…but crawler_version is a declared parameter on CrawlerChapter (lncrawl/core/models.py:123), and _ModelBox.get_extras() returns only kwargs beyond declared params. So when fetch_chapter does chapter.extra.update(model.get_extras()) at save time, crawler_version is filtered out — it's never written to the JSON extra blob. chapter.extra.get("crawler_version") is None forever, never equals the int crawler.version, and the skip never fires.

Verified empirically on the test DB: every completed chapter has extra = {}. Nothing was being written.

There's a deeper hole behind this — crawler.version is set to int(file.st_mtime) of the source crawler file (lncrawl/services/sources/helper.py:152), which changes on git checkout, OneDrive sync, an editor save, etc. Even if the persistence were fixed, the comparison would be fragile. Storing the version in a proper DAO column is the long-term answer, but that's a migration and out of scope for "make resume work."

This PR keeps the change minimal: trust the on-disk file. If the chapter file exists, skip the download. Anyone who really wants to re-fetch already has --refresh.

if not refresh and chapter.is_available:
    return chapter

Test plan

  • Run lncrawl crawl --noin --all -f <fmt> <URL> on a partially-downloaded novel. Confirm the progress bar zooms through the already-on-disk chapters (≈1000+ c/s) and only the missing ones actually hit the network.
  • Verified locally: a 1734-chapter novel with 1016 on-disk chapters processed the first 1020 in under a second, then slowed to live download speed for the rest.

The previous skip condition compared chapter.extra["crawler_version"]
against the live crawler.version, but the version was never actually
written into extra — get_extras() omits declared init params on
CrawlerChapter, so update(model.get_extras()) added nothing. The check
was always False, every chapter re-downloaded on every run.

Trust file presence on disk for the skip. If a user really needs to
re-download a stored chapter they already have --refresh.
and chapter.is_available
and chapter.extra.get("crawler_version") == crawler.version
):
if not refresh and chapter.is_available:
Copy link
Copy Markdown
Collaborator

@dipu-bd dipu-bd May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if crawler version does not match that means it could fetch different content and should not use the cache. for example: a crawler can add an additional cleaner, and chapter content can be more polished with it in the next fetch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants