LNCrawl 4.X Roadmap and Scope #2867

VibrantClouds · 2026-01-31T02:25:40Z

VibrantClouds
Jan 31, 2026

Headnote: The only LLM usage in writing this discussion post was to help create the mermaid diagram on my second post

I've reached out to @dipu-bd over email with some introductions. There were some discussions regarding where LNCrawl 4.X was going to go, I wanted to raise the conversation over here to a Github Discussion in the spirit of open source development!

Here is the context from the conversations in private from dipu his plan for handling sources

keep sources in a separate repository
show code editor in web for editing sources
after editing, do a mock run to verify it's working
commit and create a PR in the repository from the web.

Looking ahead, I plan to add functionality in the web client to allow users to modify or add sources directly. Those changes would then either generate pull requests or be merged into the repository after verification. That’s part of why the frontend has been getting more attention recently.

Before any suggestions or work gets done towards this goal, I wanted to clarify a few of these points and ensure the scope for version 4.X properly encapsulated. It sounds like there are a lot of changes going on all at once, teeing up to be a rather significant shift in how this application functions.

First, I'd like to make sure that there's a clear understanding of the goal you're aiming for here, I have a small list of questions I'd like to discuss

What is the purpose of being able to edit sources via a web interface? YAGNI and KISS principals generally state that an application should only do what it needs to do. As this project is a Light Novel scraper/binder, what does providing a code editor to a browser provide to users which simply allowing them to make a pull request to fix a source themselves wouldn't?
- A beginner developer with good instructions on setting up an IDE environment to make these modifications directly and Pull Request as needed should be simple enough, wouldn't it? What would providing a custom Web UI for making these changes accomplish besides adding complexity into the system?
- Sources split out in a separate repository should be able to handle testing all of its contents automatically through a Git Action, wouldn't it? Your proposal of checking a source's scraping rules for validity is a good idea, but wouldn't that simply be best to act as a git action which dynamically loads the module in the pull request and acts as a PR gate? This way the same action task can be scheduled to run against everything in the source repository to keep the passing/failing tags up to date!

Quoting another point from our emails

Long-term, my intention is to move the project in a direction closer to a self-hosted model (more akin to something like a Plex server), rather than a centrally hosted service.

Does this imply that you are looking to turn this application into an alternative option of Calibre Web Automated where you'd be able to have this application keep track of sources, and be able to query and pull novels simply by running through this interface?
- Have you considered scoping the progression of this project to simply act as a sidecar for a more mature project like CWA? Calibre Web does not provide functions for actually collecting light novels. This project does. I feel like a prudent direction in this project should be aiming to compliment a system like that. (Or alternatively, simply acting with a nice WebUI to store novels you are interested in, and getting a EPUB or whatever format out that you'd like)
- What is your intention to self host? What else is needed for the core of this project besides a python application which can collect text online and convert it to an output format of a user's liking?

keep sources in a separate repository

I think this item right here is a very good key update which could be provided. If this is done well, a single version of LNCrawl would be able to dynamically update itself against a remote repository (which is vetted and acts as the set of active sources)

There are probably many different ways to approach this, I am curious to hear what you had in mind @dipu-bd

VibrantClouds · 2026-01-31T02:29:54Z

VibrantClouds
Jan 31, 2026
Author

Migrating Sources to External Repositories - A Suggestion

Loading dynamic modules isn't exactly something Python excels at, something like Typescript source bundles may play a little nicer since some methods such as the MadaraTemplate which other python scrapers extend will both need to be evaluated in order for things to function right.

In Typescript you'd simply be able to transpile each individual source and get a single output which contains everything (including upper level inherited structures) available so that the root application can handle it all.

Fortunately, Python can still handle this, you'd need to be a little more careful with dynamic module loading. It'd likely make sense to build a formalized toolchain for handling sources.

My mind roughly goes to a system such as this. In this example, this assumes a user requested to scrape something which belongs on fanfiction.net - The master index would be responsible for defining the URL prefixes which each crawler is able to handle appropriately.

  graph TD;                                                                        
      subgraph User Request;                                                       
          URL[User provides URL];                                                  
      end;                                                                         
                                                                                   
      subgraph Core Application;                                                   
          AbstractCrawler[Abstract Crawler Class];                                 
          SourcesService[Sources Service];                                         
          SecurityLayer[Security Boundary];                                        
          DynamicLoader[Dynamic Loader];                                           
      end;                                                                         
                                                                                   
      subgraph GitHub Pages Remote Source Repository;                                                       
          MasterIndex[Master Index];                                               
          subgraph fanfiction.net;                                                     
              Meta1[metadata.json];                                                
              Crawler1[crawler.py];                                                
          end;                                                                     
          subgraph royalroad;                                                      
              Meta2[metadata.json];                                                
              Crawler2[crawler.py];                                                
          end;                                                                     
          subgraph templates;                                                      
              TMeta[template metadata.json];                                       
              TCode[template.py];                                                  
          end;                                                                     
      end;                                                                         
                                                                                   
      subgraph Runtime;                                                            
          Instance[Crawler Instance];                                              
          Data[Novel Data];                                                        
      end;                                                                         
                                                                                   
      URL-->SourcesService;                                                        
      SourcesService-->|1 Fetch index|MasterIndex;                                 
      MasterIndex-->|2 Match URL|Meta1;                                            
      Meta1-->|3 Get metadata|SourcesService;                                      
      SourcesService-->|4 Download|Crawler1;                                       
      Crawler1-->SecurityLayer;                                                    
      SecurityLayer-->|5 Validate|DynamicLoader;                                   
      DynamicLoader-->|6 Instantiate|Instance;                                     
      AbstractCrawler-.->|implements|Instance;                                     
      Instance-->|7 Execute|Data;

I somewhat suspect that this was not the direction you had in mind though, what were you thinking for your source split? How would the root application handle retrieving the scraping payloads?

0 replies

MikePrime21 · 2026-02-01T21:22:21Z

MikePrime21
Feb 1, 2026

This seems like something the author should be posting, not sure if you should be airing private conversations out publicly. I'll put in my 2 cents and leave it at that. If this software keeps moving in the current direction, it will die. As a command line software, it works great. But when I used the new web interface on the dev branch I was disappointed. It's evolved into some kind of central request software and even supports reading online. Sigh, I can already read online by going to the website, I want an epub so I can read offline and store it in my collection. The whole point is that when these sites get taken down, I still have a copy of the books. So a centralized system makes no sense, just look at the last discord channel.

I just want something I can run daily and pull new chapters of my novels. I have other tools for organizing ebooks and converting them to tts audio books, and ebook readers that already do a great job. So the problem I have is downloading web novels into consumable formats for my other tools. That is the purpose of this tool, and its focus should not be lost.

That said, this tool is really nothing more than a job runner for executing brittle scripts that download novels. So the bulk of the effort must be put into executing those scripts reliably and updating those scripts as quickly as possible when they break. But last I looked there are 400 scripts. That's just too much to maintain, especially if those scripts are not used by the author. So the best way to get a fast turnaround is to lower the script authoring impediments.

If you tell me I need to download some source code, install python, get some special editor, submit a change request to github, I'm sorry, but I'm not doing that. If I can, I'll just fix it locally and run it myself. If not I'll go to another tool. In fact once this tool breaks, I just move that book over to Automa chrome extension and download chapters using that. As you can see I'm technical enough to write a scraper, I just don't use python and have no interest in getting an environment setup to run it. I won't even install python on my pc to run this app, I just put it all in a docker container and run it there.

So if you give me a way to fix broken scripts inside the application and a simple button to submit it remotely, then I would give it my best effort to fix any broken scripts I personally use. And maybe given enough time I might even write a new script for a different site I use. But I'm not a python developer, and if you tell me I just need to download all these tools, set up an environment, and do all the steps, I'm just not interested. I'll just find another tool.

I'll just say this, any tool you give me to fix broken scripts and test the result better not affect my current epubs I've downloaded. Also it would be nice to use my own hosted git repo or even local file system while I'm testing it. It would be nice to switch over to my own copy and use that while waiting for the commit/pr approval. Also would be handy If I modify scripts that don't make sense to make public. like replacing words or stuff like that.

As a side note, there is another software that does something almost identical to this tool, suwayomi for manga downloads. If you want, it may make more sense to understand how that works when thinking about the future of the tool. Ultimately it has the server engine for downloading manga and exposing a graphql api. Then the web interface and any other clients all use the graphql interface. You should remember there are many other tools out there and if you try to do everything like adding in reading and tracking and discord hosting, you'll just end up wasting time. A few people will love it, but most people just want a tool that works. And the less people you have the less potential script writers you have, and the longer the turn around on broken scripts.

1 reply

VibrantClouds Feb 2, 2026
Author

We discussed in the email bringing it to a public forum like this and had approval, I agree that this out of nowhere would be definitely a little rude! Way ahead of you.

@MikePrime21 - I do admit I share some of those concerns which is why I brought up the YAGNI and KISS principals earlier. As a command line tool, a binary, it doing exactly what it needs to do and that is it feels to me where most of the value is for this project. A Web Interface feels like bloat may make sense if the operation of the binaries are difficult to the point where a more guided experience, fine. But in this case, a ./lncrawl crawl <URL> is.. Quite honestly, incredibly simple and probably doesn't need very much on top of it. Realistically even the search commands aren't something I could see myself ever using often. It's much easier to simply go to a source I already know and prefer, and pull the link from there. It's simple enough to toss together a tiny .sh or .bat script to act as a shortcut for updating my content.

The major problem which I do think needs addressed is a way for sources to be updated without requiring users to download a new binary every time there is an update. It'd provide users a more seamless experience without having to download releases every 3 weeks as these websites and their scraping rule change. The comment above addresses a possible methodology to handling that in a technology clean way.

With the advent of LLMs gaining a lot of popularity, AGENT.md files like I've proposed in #2861 so that even less technical users 'can' fix their favorite source is probably another big 'bang for the buck' item. LLMs are more than capable of reading HTML of websites (A proposed source you have), and with the proper context population on how this system operates, most sources can likely be repaired or created completely automatically even for users without programming experience.

Also it would be nice to use my own hosted git repo or even local file system while I'm testing it.

This point is relatively addressed with the second post in this discussion. In my example it's pointing to a 'Master index', basically a repo hosted somewhere else on github to handle its module loading, but realistically nothing would be stopping users from forking the repo, and changing the Index URL to their own copy so that users don't need to wait for their changes to be approved into the maintainer's branch.

This could also easily have a 'local' path instead of a 'remote' path for users who have a source registry cloned to their system.

All in all, I believe you and I are on close to the same page. This application I don't think needs to become some super-app which does everything. It simply needs to be good at the discrete job it was built for. Collecting text content and providing users with EPUBS (Or whatever other binding format is being targeted)

All of the other features can be handled by other projects which are already mature in this space.

As a CLI tool, users can build their own scripts and run this as a subprocess to include it in their own automation, even.

dipu-bd · 2026-02-02T09:04:21Z

dipu-bd
Feb 2, 2026
Maintainer

The purpose of the web app is to provide less-tech-savvy users to have a way to use this app. Editing and adding sources via a good interface is also to quickly fix or add crawlers even without much programming knowledge. The CLI will always be there, web will be just an extension of it. Eventually this project will be divided into several subprojects to reduce bloating.

I am on vacation at the moment. Once I am back, I can provide more details. Also feel free to discuss yours suggestions, it will be very helpful on making the final design.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LNCrawl 4.X Roadmap and Scope #2867

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

LNCrawl 4.X Roadmap and Scope #2867

Uh oh!

Uh oh!

VibrantClouds Jan 31, 2026

Replies: 3 comments · 1 reply

Uh oh!

Uh oh!

VibrantClouds Jan 31, 2026 Author

Migrating Sources to External Repositories - A Suggestion

Uh oh!

MikePrime21 Feb 1, 2026

Uh oh!

Uh oh!

VibrantClouds Feb 2, 2026 Author

Uh oh!

dipu-bd Feb 2, 2026 Maintainer

VibrantClouds
Jan 31, 2026

Replies: 3 comments 1 reply

VibrantClouds
Jan 31, 2026
Author

MikePrime21
Feb 1, 2026

VibrantClouds Feb 2, 2026
Author

dipu-bd
Feb 2, 2026
Maintainer