LNCrawl 4.X Roadmap and Scope #2867
Replies: 3 comments 1 reply
-
Migrating Sources to External Repositories - A SuggestionLoading dynamic modules isn't exactly something Python excels at, something like Typescript source bundles may play a little nicer since some methods such as the In Typescript you'd simply be able to transpile each individual source and get a single output which contains everything (including upper level inherited structures) available so that the root application can handle it all. Fortunately, Python can still handle this, you'd need to be a little more careful with dynamic module loading. It'd likely make sense to build a formalized toolchain for handling sources. My mind roughly goes to a system such as this. In this example, this assumes a user requested to scrape something which belongs on graph TD;
subgraph User Request;
URL[User provides URL];
end;
subgraph Core Application;
AbstractCrawler[Abstract Crawler Class];
SourcesService[Sources Service];
SecurityLayer[Security Boundary];
DynamicLoader[Dynamic Loader];
end;
subgraph GitHub Pages Remote Source Repository;
MasterIndex[Master Index];
subgraph fanfiction.net;
Meta1[metadata.json];
Crawler1[crawler.py];
end;
subgraph royalroad;
Meta2[metadata.json];
Crawler2[crawler.py];
end;
subgraph templates;
TMeta[template metadata.json];
TCode[template.py];
end;
end;
subgraph Runtime;
Instance[Crawler Instance];
Data[Novel Data];
end;
URL-->SourcesService;
SourcesService-->|1 Fetch index|MasterIndex;
MasterIndex-->|2 Match URL|Meta1;
Meta1-->|3 Get metadata|SourcesService;
SourcesService-->|4 Download|Crawler1;
Crawler1-->SecurityLayer;
SecurityLayer-->|5 Validate|DynamicLoader;
DynamicLoader-->|6 Instantiate|Instance;
AbstractCrawler-.->|implements|Instance;
Instance-->|7 Execute|Data;
I somewhat suspect that this was not the direction you had in mind though, what were you thinking for your source split? How would the root application handle retrieving the scraping payloads? |
Beta Was this translation helpful? Give feedback.
-
|
This seems like something the author should be posting, not sure if you should be airing private conversations out publicly. I'll put in my 2 cents and leave it at that. If this software keeps moving in the current direction, it will die. As a command line software, it works great. But when I used the new web interface on the dev branch I was disappointed. It's evolved into some kind of central request software and even supports reading online. Sigh, I can already read online by going to the website, I want an epub so I can read offline and store it in my collection. The whole point is that when these sites get taken down, I still have a copy of the books. So a centralized system makes no sense, just look at the last discord channel. I just want something I can run daily and pull new chapters of my novels. I have other tools for organizing ebooks and converting them to tts audio books, and ebook readers that already do a great job. So the problem I have is downloading web novels into consumable formats for my other tools. That is the purpose of this tool, and its focus should not be lost. That said, this tool is really nothing more than a job runner for executing brittle scripts that download novels. So the bulk of the effort must be put into executing those scripts reliably and updating those scripts as quickly as possible when they break. But last I looked there are 400 scripts. That's just too much to maintain, especially if those scripts are not used by the author. So the best way to get a fast turnaround is to lower the script authoring impediments. If you tell me I need to download some source code, install python, get some special editor, submit a change request to github, I'm sorry, but I'm not doing that. If I can, I'll just fix it locally and run it myself. If not I'll go to another tool. In fact once this tool breaks, I just move that book over to Automa chrome extension and download chapters using that. As you can see I'm technical enough to write a scraper, I just don't use python and have no interest in getting an environment setup to run it. I won't even install python on my pc to run this app, I just put it all in a docker container and run it there. So if you give me a way to fix broken scripts inside the application and a simple button to submit it remotely, then I would give it my best effort to fix any broken scripts I personally use. And maybe given enough time I might even write a new script for a different site I use. But I'm not a python developer, and if you tell me I just need to download all these tools, set up an environment, and do all the steps, I'm just not interested. I'll just find another tool. I'll just say this, any tool you give me to fix broken scripts and test the result better not affect my current epubs I've downloaded. Also it would be nice to use my own hosted git repo or even local file system while I'm testing it. It would be nice to switch over to my own copy and use that while waiting for the commit/pr approval. Also would be handy If I modify scripts that don't make sense to make public. like replacing words or stuff like that. As a side note, there is another software that does something almost identical to this tool, suwayomi for manga downloads. If you want, it may make more sense to understand how that works when thinking about the future of the tool. Ultimately it has the server engine for downloading manga and exposing a graphql api. Then the web interface and any other clients all use the graphql interface. You should remember there are many other tools out there and if you try to do everything like adding in reading and tracking and discord hosting, you'll just end up wasting time. A few people will love it, but most people just want a tool that works. And the less people you have the less potential script writers you have, and the longer the turn around on broken scripts. |
Beta Was this translation helpful? Give feedback.
-
|
The purpose of the web app is to provide less-tech-savvy users to have a way to use this app. Editing and adding sources via a good interface is also to quickly fix or add crawlers even without much programming knowledge. The CLI will always be there, web will be just an extension of it. Eventually this project will be divided into several subprojects to reduce bloating. I am on vacation at the moment. Once I am back, I can provide more details. Also feel free to discuss yours suggestions, it will be very helpful on making the final design. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Headnote: The only LLM usage in writing this discussion post was to help create the mermaid diagram on my second post
I've reached out to @dipu-bd over email with some introductions. There were some discussions regarding where LNCrawl 4.X was going to go, I wanted to raise the conversation over here to a Github Discussion in the spirit of open source development!
Here is the context from the conversations in private from dipu his plan for handling sources
Before any suggestions or work gets done towards this goal, I wanted to clarify a few of these points and ensure the scope for version
4.Xproperly encapsulated. It sounds like there are a lot of changes going on all at once, teeing up to be a rather significant shift in how this application functions.First, I'd like to make sure that there's a clear understanding of the goal you're aiming for here, I have a small list of questions I'd like to discuss
passing/failingtags up to date!Quoting another point from our emails
I think this item right here is a very good key update which could be provided. If this is done well, a single version of LNCrawl would be able to dynamically update itself against a remote repository (which is vetted and acts as the set of active sources)
There are probably many different ways to approach this, I am curious to hear what you had in mind @dipu-bd
Beta Was this translation helpful? Give feedback.
All reactions