feat: Add crawl4ai as an option for local scrape#11553
Draft
relic664 wants to merge 6 commits intodanny-avila:mainfrom
Draft
feat: Add crawl4ai as an option for local scrape#11553relic664 wants to merge 6 commits intodanny-avila:mainfrom
relic664 wants to merge 6 commits intodanny-avila:mainfrom
Conversation
…updated agents, only merge once done
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Template
Summary
This is a PR to add crawl4ai as an option for local crawling, as part of the web search pipeline, to supplement the existing firecrawl option. This PR is dependent on the addition to crawl4ai to agents in danny-avila/agents#51.
This PR builds on @lukolszewski original work and adds
fitas a default option for crawl4ai's/mdendpoint, which uses adaptive filtering to filter the markdown, or returns just the raw markdown (rawforfitStrategy).I've tested locally and it works fine. I have docker image for testing if anybody is interested (ghcr.io/relic664/librechat:latest). Scraper configuration should work fine either via UI or via config file, although I haven't translated the UI strings to other languages besides English.
It's worth nothing that this is a very basic implementation to provide a simple option for a self-hosted scrape option. This implementation doesn't provide options for the scrape beyond
fit(filtered markdown) orraw(raw markdown). Given that there's only one self-hosted option for scrape, I thought it was prudent to go ahead and make a MVP PR for crawl4ai before a full featured implementation with all the configuration knobs.This feature was already discussed and notionally approved in #10474.
Documentation PR is here LibreChat-AI/librechat.ai#494.
Change Type
Testing
I have a docker image, based on v0.8.2-rc3, that can be used for testing (
docker pull ghcr.io/relic664/librechat:sha-565506a).Test Configuration:
Pull the image or build from my forks. If building, note that the modified
@librechat/agentspackage from my agents fork is needed. There is a complementary PR to merge those changes here.A simple quick start is to set the env var
CRAWL4AI_API_URLto your instance, and in yourlibrechat.config:Checklist