[Bug] PDF Content Incorrectly Dumped into HTML/Markdown Fields During Web Craw #28

Chippers255 · 2025-01-13T00:52:37Z

Describe the Bug
When crawling websites containing PDF files, the crawler is including the raw PDF contents in both HTML and markdown output fields. This causes performance issues by:

Significantly increasing response time when retrieving results
Taking up unnecessary storage space in the results database
Potentially making the results harder to parse and use effectively

To Reproduce

Run the crawl command on the target website containing PDF files (eg: https://becu.org/)
Observe the returned results in both HTML and markdown fields (eg: http://{HOST URL}/v1/crawl/{CRAWL ID})
Notice that PDF contents are being dumped as raw text into these fields

Screenshots

The text was updated successfully, but these errors were encountered:

Chippers255 added the bug Something isn't working label Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] PDF Content Incorrectly Dumped into HTML/Markdown Fields During Web Craw #28

[Bug] PDF Content Incorrectly Dumped into HTML/Markdown Fields During Web Craw #28

Chippers255 commented Jan 13, 2025

[Bug] PDF Content Incorrectly Dumped into HTML/Markdown Fields During Web Craw #28

[Bug] PDF Content Incorrectly Dumped into HTML/Markdown Fields During Web Craw #28

Comments

Chippers255 commented Jan 13, 2025