Feature: remove textract and use claude api to extract files #6

atraining · 2025-09-03T07:40:11Z

Completed Changes

Refactored File Parser (server/utils/fileParser.ts)

Replaced @nosferatu500/textract with Claude API for text extraction
Added support for more file types including:
- Documents: PDFs, Word docs, Excel sheets, PowerPoint presentations
- Images: PNG, JPEG, GIF, WebP, BMP, TIFF with OCR capabilities
- Plain text: TXT, Markdown, JSON, code files, etc.

Enhanced Features

Better OCR: Claude's vision model provides superior text recognition from images
Document Processing: Native PDF and Office document text extraction
Error Handling: Comprehensive error messages for different failure scenarios
Performance Optimization: API instance caching and optimized prompts
File Size Validation: 32MB file size limit with clear error messages

Code Optimizations

Caching of Claude API instances to improve performance
Better error handling with specific messages for different HTTP status codes
TypeScript improvements to eliminate compiler warnings
Optimized prompts for better text extraction quality

Dependencies & Documentation

Removed @nosferatu500/textract dependency from package.json
Updated README.md to reflect the new Claude API-powered text extraction
Removed system dependency requirements (no more poppler-utils needed)

Integration Updates

Updated file upload endpoint to pass the Anthropic API key to the parser
Maintained compatibility with existing database schema and token counting

Key Benefits

No System Dependencies: Everything is handled through the Claude API - no need for poppler-utils or other system packages
Better Accuracy: Claude's advanced vision and document processing capabilities provide superior text extraction
More File Types: Support for additional formats that weren't supported by textract
Better Error Handling: More informative error messages help users understand what went wrong
Performance: Caching and optimizations reduce API calls and improve response times

Ready to Use

The application builds successfully and is ready for deployment. The new implementation will provide better text extraction capabilities for your users when
they upload PDFs, images, and other document types to the chat interface.

server/api/threads/[id]/files/index.post.ts

server/utils/fileParser.ts

…l#6 (comment)

atraining added 2 commits September 3, 2025 09:39

add: remove textract and use claude api to extract files

7f66579

fix: delete button

fef418b

atraining changed the title ~~Feautre: add: remove textract and use claude api to extract files~~ Feautre: remove textract and use claude api to extract files Sep 3, 2025

atraining changed the title ~~Feautre: remove textract and use claude api to extract files~~ Feature: remove textract and use claude api to extract files Sep 3, 2025

chihebnabil self-requested a review September 3, 2025 09:55

chihebnabil added the enhancement New feature or request label Sep 3, 2025

chihebnabil requested changes Sep 3, 2025

View reviewed changes

server/api/threads/[id]/files/index.post.ts Outdated Show resolved Hide resolved

server/utils/fileParser.ts Outdated Show resolved Hide resolved

fix: reuse selected model, see chihebnabil#6 (comment) and chihebnabi…

9202a06

…l#6 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feature: remove textract and use claude api to extract files #6

Feature: remove textract and use claude api to extract files #6

Uh oh!

atraining commented Sep 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Feature: remove textract and use claude api to extract files #6

Are you sure you want to change the base?

Feature: remove textract and use claude api to extract files #6

Uh oh!

Conversation

atraining commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Completed Changes

Key Benefits

Ready to Use

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

atraining commented Sep 3, 2025 •

edited

Loading