Skip to content

Conversation

@atraining
Copy link

@atraining atraining commented Sep 3, 2025

Completed Changes

msedge_RYypGGciMy

  1. Refactored File Parser (server/utils/fileParser.ts)
  • Replaced @nosferatu500/textract with Claude API for text extraction
  • Added support for more file types including:
    • Documents: PDFs, Word docs, Excel sheets, PowerPoint presentations
    • Images: PNG, JPEG, GIF, WebP, BMP, TIFF with OCR capabilities
    • Plain text: TXT, Markdown, JSON, code files, etc.
  1. Enhanced Features
  • Better OCR: Claude's vision model provides superior text recognition from images
  • Document Processing: Native PDF and Office document text extraction
  • Error Handling: Comprehensive error messages for different failure scenarios
  • Performance Optimization: API instance caching and optimized prompts
  • File Size Validation: 32MB file size limit with clear error messages
  1. Code Optimizations
  • Caching of Claude API instances to improve performance
  • Better error handling with specific messages for different HTTP status codes
  • TypeScript improvements to eliminate compiler warnings
  • Optimized prompts for better text extraction quality
  1. Dependencies & Documentation
  • Removed @nosferatu500/textract dependency from package.json
  • Updated README.md to reflect the new Claude API-powered text extraction
  • Removed system dependency requirements (no more poppler-utils needed)
  1. Integration Updates
  • Updated file upload endpoint to pass the Anthropic API key to the parser
  • Maintained compatibility with existing database schema and token counting

Key Benefits

  1. No System Dependencies: Everything is handled through the Claude API - no need for poppler-utils or other system packages
  2. Better Accuracy: Claude's advanced vision and document processing capabilities provide superior text extraction
  3. More File Types: Support for additional formats that weren't supported by textract
  4. Better Error Handling: More informative error messages help users understand what went wrong
  5. Performance: Caching and optimizations reduce API calls and improve response times

Ready to Use

The application builds successfully and is ready for deployment. The new implementation will provide better text extraction capabilities for your users when
they upload PDFs, images, and other document types to the chat interface.

@atraining atraining changed the title Feautre: add: remove textract and use claude api to extract files Feautre: remove textract and use claude api to extract files Sep 3, 2025
@atraining atraining changed the title Feautre: remove textract and use claude api to extract files Feature: remove textract and use claude api to extract files Sep 3, 2025
@chihebnabil chihebnabil self-requested a review September 3, 2025 09:55
@chihebnabil chihebnabil added the enhancement New feature or request label Sep 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants