Skip to content
This repository was archived by the owner on Jun 2, 2025. It is now read-only.

W2 Queries for Textract#89

Merged
halprin merged 20 commits into
mainfrom
try-queries
Mar 24, 2025
Merged

W2 Queries for Textract#89
halprin merged 20 commits into
mainfrom
try-queries

Conversation

@halprin
Copy link
Copy Markdown
Member

@halprin halprin commented Mar 19, 2025

Changes proposed in this pull request

This is the start of queries for text extraction. Supports W2 only for now.

  • I changed the OCR interface to just get the raw text because that's closer to what it is actually doing. Our OCR implementations aren't actually identifying the document.
  • I added the iterator-chain dependency. It allows you to do things very similar to Java streams.
  • Added a Form interface. I then create an implementation W2 that contains information on what to look for to identify a W2 and the queries to scan for using the OCR implementation.
  • Addes simple implementations for 1099-NEC and DD214 just so we can continue to identify the forms, but no queries yet.

Towards #72.

@halprin halprin mentioned this pull request Mar 19, 2025
@halprin halprin marked this pull request as ready for review March 24, 2025 15:16
@halprin halprin requested a review from a team as a code owner March 24, 2025 15:16
@halprin halprin merged commit 3f0c643 into main Mar 24, 2025
@halprin halprin deleted the try-queries branch March 24, 2025 19:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants