Fix resume — skip chapter download when content file already exists#2968
Open
apetros wants to merge 1 commit into
Open
Fix resume — skip chapter download when content file already exists#2968apetros wants to merge 1 commit into
apetros wants to merge 1 commit into
Conversation
The previous skip condition compared chapter.extra["crawler_version"] against the live crawler.version, but the version was never actually written into extra — get_extras() omits declared init params on CrawlerChapter, so update(model.get_extras()) added nothing. The check was always False, every chapter re-downloaded on every run. Trust file presence on disk for the skip. If a user really needs to re-download a stored chapter they already have --refresh.
dipu-bd
reviewed
May 13, 2026
| and chapter.is_available | ||
| and chapter.extra.get("crawler_version") == crawler.version | ||
| ): | ||
| if not refresh and chapter.is_available: |
Collaborator
There was a problem hiding this comment.
if crawler version does not match that means it could fetch different content and should not use the cache. for example: a crawler can add an additional cleaner, and chapter content can be more polished with it in the next fetch.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Re-running
lncrawl crawl --noin --all -f epub <URL>on a half-finished novel re-downloaded every single already-saved chapter instead of resuming. With 1016/1734 chapters on disk and a long crawl interrupted by Cloudflare, I was watching it start over from zero every time.The skip condition in
CrawlerService.fetch_chapterlooks fine at first glance:…but
crawler_versionis a declared parameter onCrawlerChapter(lncrawl/core/models.py:123), and_ModelBox.get_extras()returns only kwargs beyond declared params. So whenfetch_chapterdoeschapter.extra.update(model.get_extras())at save time,crawler_versionis filtered out — it's never written to the JSONextrablob.chapter.extra.get("crawler_version")isNoneforever, never equals the intcrawler.version, and the skip never fires.Verified empirically on the test DB: every completed chapter has
extra = {}. Nothing was being written.There's a deeper hole behind this —
crawler.versionis set toint(file.st_mtime)of the source crawler file (lncrawl/services/sources/helper.py:152), which changes ongit checkout, OneDrive sync, an editor save, etc. Even if the persistence were fixed, the comparison would be fragile. Storing the version in a proper DAO column is the long-term answer, but that's a migration and out of scope for "make resume work."This PR keeps the change minimal: trust the on-disk file. If the chapter file exists, skip the download. Anyone who really wants to re-fetch already has
--refresh.Test plan
lncrawl crawl --noin --all -f <fmt> <URL>on a partially-downloaded novel. Confirm the progress bar zooms through the already-on-disk chapters (≈1000+ c/s) and only the missing ones actually hit the network.