You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the Bug
When crawling websites containing PDF files, the crawler is including the raw PDF contents in both HTML and markdown output fields. This causes performance issues by:
Significantly increasing response time when retrieving results
Taking up unnecessary storage space in the results database
Potentially making the results harder to parse and use effectively
To Reproduce
Run the crawl command on the target website containing PDF files (eg: https://becu.org/)
Observe the returned results in both HTML and markdown fields (eg: http://{HOST URL}/v1/crawl/{CRAWL ID})
Notice that PDF contents are being dumped as raw text into these fields
Screenshots
The text was updated successfully, but these errors were encountered:
Describe the Bug
When crawling websites containing PDF files, the crawler is including the raw PDF contents in both HTML and markdown output fields. This causes performance issues by:
To Reproduce
https://becu.org/
)http://{HOST URL}/v1/crawl/{CRAWL ID}
)Screenshots
The text was updated successfully, but these errors were encountered: