Skip to content

feat: Add crawl4ai as an option for local scrape#11553

Draft
relic664 wants to merge 6 commits intodanny-avila:mainfrom
relic664:feature/crawl4ai-with-fit
Draft

feat: Add crawl4ai as an option for local scrape#11553
relic664 wants to merge 6 commits intodanny-avila:mainfrom
relic664:feature/crawl4ai-with-fit

Conversation

@relic664
Copy link

@relic664 relic664 commented Jan 28, 2026

Pull Request Template

Summary

This is a PR to add crawl4ai as an option for local crawling, as part of the web search pipeline, to supplement the existing firecrawl option. This PR is dependent on the addition to crawl4ai to agents in danny-avila/agents#51.

This PR builds on @lukolszewski original work and adds fit as a default option for crawl4ai's /md endpoint, which uses adaptive filtering to filter the markdown, or returns just the raw markdown (raw for fitStrategy).

I've tested locally and it works fine. I have docker image for testing if anybody is interested (ghcr.io/relic664/librechat:latest). Scraper configuration should work fine either via UI or via config file, although I haven't translated the UI strings to other languages besides English.

It's worth nothing that this is a very basic implementation to provide a simple option for a self-hosted scrape option. This implementation doesn't provide options for the scrape beyond fit (filtered markdown) or raw (raw markdown). Given that there's only one self-hosted option for scrape, I thought it was prudent to go ahead and make a MVP PR for crawl4ai before a full featured implementation with all the configuration knobs.

This feature was already discussed and notionally approved in #10474.

Documentation PR is here LibreChat-AI/librechat.ai#494.

Change Type

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Testing

I have a docker image, based on v0.8.2-rc3, that can be used for testing (docker pull ghcr.io/relic664/librechat:sha-565506a).

Test Configuration:

Pull the image or build from my forks. If building, note that the modified @librechat/agents package from my agents fork is needed. There is a complementary PR to merge those changes here.

A simple quick start is to set the env var CRAWL4AI_API_URL to your instance, and in your librechat.config:

webSearch:
  crawl4aiApiUrl: "${CRAWL4AI_API_URL}"
  scraperProvider: "crawl4ai"

Checklist

  • My code adheres to this project's style guidelines
  • I have performed a self-review of my own code
  • I have commented in any complex areas of my code
  • I have made pertinent documentation changes
  • My changes do not introduce new warnings
  • Local unit tests pass with my changes
  • Any changes dependent on mine have been merged and published in downstream modules.
  • A pull request for updating the documentation has been submitted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants