Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

W2 Queries for Textract #89

Merged
merged 20 commits into from
Mar 24, 2025
Merged

W2 Queries for Textract #89

merged 20 commits into from
Mar 24, 2025

Conversation

halprin
Copy link
Member

@halprin halprin commented Mar 19, 2025

Changes proposed in this pull request

This is the start of queries for text extraction. Supports W2 only for now.

  • I changed the OCR interface to just get the raw text because that's closer to what it is actually doing. Our OCR implementations aren't actually identifying the document.
  • I added the iterator-chain dependency. It allows you to do things very similar to Java streams.
  • Added a Form interface. I then create an implementation W2 that contains information on what to look for to identify a W2 and the queries to scan for using the OCR implementation.
  • Addes simple implementations for 1099-NEC and DD214 just so we can continue to identify the forms, but no queries yet.

Towards #72.

@halprin halprin mentioned this pull request Mar 19, 2025
@halprin halprin marked this pull request as ready for review March 24, 2025 15:16
@halprin halprin requested a review from a team as a code owner March 24, 2025 15:16
@halprin halprin merged commit 3f0c643 into main Mar 24, 2025
10 checks passed
@halprin halprin deleted the try-queries branch March 24, 2025 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants