Skip to content

Comments

fix(cdk): upgrade unstructured from 0.10.27 to 0.18.32#908

Open
Ryan Waskewich (rwask) wants to merge 7 commits intomainfrom
devin/1771425511-bump-unstructured-to-latest
Open

fix(cdk): upgrade unstructured from 0.10.27 to 0.18.32#908
Ryan Waskewich (rwask) wants to merge 7 commits intomainfrom
devin/1771425511-bump-unstructured-to-latest

Conversation

@rwask
Copy link
Contributor

@rwask Ryan Waskewich (rwask) commented Feb 18, 2026

fix(cdk): upgrade unstructured from 0.10.27 to 0.18.32

Summary

Bumps the unstructured document parsing library from 0.10.27 to 0.18.32 in the CDK's file-based extra. This is a large version jump (8 minor versions) that required migrating several removed/changed APIs in unstructured_parser.py:

  • Removed dict-based lookups (EXT_TO_FILETYPE, FILETYPE_TO_MIMETYPE, STR_TO_FILETYPE) → replaced with FileType.from_extension(), filetype.mime_type, FileType.from_mime_type()
  • detect_filetype parameter renamed: filename=file_path=
  • partition_pdf now requires unstructured_inference: wrapped import in try/except so DOCX/PPTX parsing still works without it
  • _get_filetype detection order changed: extension-based detection now runs before content sniffing (was the opposite)

Updates since last revision

  • Fixed FileType.from_mime_type() fallthrough: from_mime_type() returns None for unknown types (not ValueError as initially assumed). Added null check and FileType.UNK guard so files with ambiguous MIME types (e.g., application/octet-stream) correctly fall through to extension/content-based detection instead of returning None immediately.
  • Updated test mock targets: unstructured.partition.pdf can no longer be imported without unstructured_inference, so test @patch decorators now target the global variables in unstructured_parser instead of the source modules. Added _import_unstructured mock to prevent the real import from overwriting test mocks.
  • Removed pi-heif dependency: Per CodeRabbit feedback, removed the pi-heif optional dependency as it's not directly imported by the CDK.
  • Updated pdfminer.six pin: Changed from exact 20221105 to >=20231228 for compatibility with unstructured 0.18.32. Note: unstructured 0.18.32's PDF module imports from pdfminer.psexceptions which was added in pdfminer.six 20250327. If PDF parsing is needed, ensure pdfminer.six>=20250327 is installed (this happens automatically when unstructured[pdf] is installed).

Production Impact — Backward Compatibility Scope

Queried the production database to assess the blast radius. Of the original ~610 source actors flagged by a broad text search for document_file_type_handler, only 115 connections across 69 workspaces actually have streams configured with "filetype": "unstructured".

Connections by Connector (total):

  • Google Drive: 92 (80%)
  • S3: 14 (12%)
  • Azure Blob Storage: 4 (3.5%)
  • SharePoint Enterprise: 3 (2.6%)
  • GCS: 1 (0.9%)
  • SFTP Bulk: 1 (0.9%)

Sync Recency:

  • Active (0–1 days): 12 (10%)
  • Recent (2–7 days): 0 (0%)
  • Last month (8–30 days): 1 (1%)
  • Stale (31–90 days): 6 (5%)
  • Dormant (90+ days): 6 (5%)
  • Never synced successfully: 90 (78%)

⚠️ Real-world blast radius is extremely limited

Only 12 connections are actively syncing today with unstructured parsing. The other 103 connections either:

  • Never successfully synced (90 connections / 78%) — likely test/sandbox setups or abandoned configurations
  • Haven't synced in over a week (13 connections) — stale or dormant

Breaking Changes for Active Connections

For the ~12 active connections, the following will break:

Change Impact Who is affected
PDF parsing requires unstructured_inference PDFs emit _ab_source_file_parse_error instead of content Any connection parsing PDF files with local processing mode
DOCX output format changed "# Content""Content" (markdown heading removed) Downstream consumers expecting markdown headings in DOCX output
Connector image size +12GB Images balloon from ~1.4GB to ~13.7GB when PDF support is added All connectors that add unstructured[pdf] extra
System library dependencies libGL.so.1 and libglib2.0-0 required for PDF inference Connector Dockerfiles need apt-get install libgl1-mesa-glx libglib2.0-0

Upgrade Path for Affected Customers

  1. For PDF parsing (local mode):

    • Connector images must install unstructured[pdf] instead of just unstructured[docx,pptx]
    • Add system deps: apt-get install -y libgl1-mesa-glx libglib2.0-0
    • Ensure pdfminer.six>=20250327 is installed
    • To minimize image size: Use CPU-only PyTorch: pip install torch --index-url https://download.pytorch.org/whl/cpu before installing unstructured (reduces ~10GB)
  2. For PDF parsing (API mode):

    • No changes needed — API mode doesn't require unstructured_inference
    • Recommend customers use API mode if image size is a concern
  3. For DOCX/PPTX only:

    • No changes needed — these work without unstructured_inference

Review & Testing Checklist for Human

  • ⚠️ PDF parsing requires unstructured_inference: This is a breaking change. PDFs now emit _ab_source_file_parse_error instead of content unless unstructured_inference is installed. Verify this is acceptable for downstream connectors (source-s3, source-gcs, source-sharepoint-enterprise, etc.). The scenario tests have been updated to expect parse errors for PDFs.
  • Verify pdfminer.six version compatibility: The pin is >=20231228 but pdfminer.psexceptions (required by unstructured 0.18.32's PDF module) was added in 20250327. If someone installs unstructured[pdf] with a pdfminer.six version between these, PDF parsing will fail. Consider tightening the pin to >=20250327.
  • Test with downstream file-based connectors to verify no regressions in actual document parsing output. No integration testing has been performed — only unit tests pass.
  • Verify _get_filetype detection order change: extension-based detection (FileType.from_extension) now runs before content sniffing (detect_filetype(file=...)). Confirm this doesn't change behavior for ambiguous files.
  • Verify DOCX content format change: Scenario tests show "# Content""Content" (markdown heading removed). Confirm this is expected behavior from the unstructured upgrade.

Notes

  • There's an existing branch devin/1771342600-bump-unstructured-0.18.18 with similar changes targeting 0.18.18. This PR targets the latest (0.18.32) instead and includes additional fixes (correct from_mime_type handling, pdfminer.six pin update).
  • The partition_pdf import is now gracefully handled — if unstructured_inference isn't installed, PDF parsing will be unavailable but DOCX/PPTX will still work. This is a behavioral change from the old code which required all three partition functions to be available.
  • The poetry.lock diff is large due to new transitive dependencies (aiofiles, unstructured-client, webencodings, etc.)
  • Unit tests (27 in test_unstructured_parser.py) pass locally with the new version.

Link to Devin run: https://app.devin.ai/sessions/c5bdff87617345b0bdbe574512f84953
Requested by: Ryan Waskewich (@rwask)

Summary by CodeRabbit

  • Improvements
    • Upgraded document parsing libraries for broader file-type support and more robust MIME/extension-based detection.
    • Detection now prefers MIME type and falls back to extension before content-based checks.
    • Per-file-type availability checks surface clearer, user-friendly parse errors when optional parsers are missing.
  • Bug Fixes
    • Remote multipart uploads now send correct MIME types.
  • Tests
    • Updated tests to reflect new parse-error behavior and revised import/mocking approach.

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
@rwask Ryan Waskewich (rwask) marked this pull request as ready for review February 18, 2026 14:47
@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1771425511-bump-unstructured-to-latest#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1771425511-bump-unstructured-to-latest

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

devin-ai-integration[bot]

This comment was marked as resolved.

…ck behavior

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

Refactors unstructured file parsing to use per-filetype availability checks and lazy PDF import failure handling, changes filetype detection order (MIME → extension → content), updates multipart upload MIME usage to use FileType.mime_type, bumps unstructured and pdfminer.six versions, and adjusts tests and test expectations to mock and reflect the new lazy-import and per-filetype parse-error behavior.

Changes

Cohort / File(s) Summary
Unstructured Parser Refactoring
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py
Replaced legacy mappings with FileType/detect_filetype; prefer FileType.from_mime_type → extension → content in _get_filetype; removed global unstructured availability checks in favor of per-filetype guards; lazy-import partition_pdf with ImportError handled by disabling PDF parsing and logging; use filetype.mime_type for multipart uploads; added explicit parse errors when partition functions are unavailable.
Dependency Update
pyproject.toml
Bumped unstructured from 0.10.27 to 0.18.32 (extras ["docx","pptx"] unchanged) and relaxed pdfminer.six to >=20231228 for optional PDF support.
Unit Tests — Parser Patching
unit_tests/sources/file_based/file_types/test_unstructured_parser.py
Updated test patches to target the parser's internal wrappers (unstructured_partition_* and _import_unstructured); added mock_import_unstructured fixture/parameter to tests to stub lazy import behavior.
Unit Tests — Scenario Expectations
unit_tests/sources/file_based/scenarios/unstructured_scenarios.py
Updated expected outputs to surface _ab_source_file_parse_error for PDF inputs when inference package is absent; adjusted content expectations for some DOCX/PDF cases to reflect new per-filetype parse-error propagation or plain-text parsing differences.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Would you like a simple Mermaid sequence diagram showing the new detection and per-filetype availability flow? wdyt?

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: upgrading the unstructured package from 0.10.27 to 0.18.32 and fixing related API incompatibilities in the file-based parser.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1771425511-bump-unstructured-to-latest

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py (1)

420-427: detect_filetype(file_path=...) with a remote URI — handled by try/except.

Since remote_file.uri is a remote path (e.g., s3://...), detect_filetype will likely fail trying to access it locally. The broad except Exception: pass catches this gracefully and falls through to extension-based detection. This works, but the silent swallowing of all exceptions could hide unexpected failures. Would it be worth narrowing to except (FileNotFoundError, OSError) to surface truly unexpected errors, wdyt?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py` around
lines 420 - 427, The try/except around
detect_filetype(file_path=remote_file.uri) currently swallows all exceptions
which can hide unexpected failures; replace the broad except Exception with a
narrower except (FileNotFoundError, OSError) to only ignore missing/local-path
errors when detect_filetype is called with a remote URI, and let other
exceptions propagate (or re-raise/log them) so unexpected errors in
detect_filetype are visible; locate the block using detect_filetype and
remote_file.uri in unstructured_parser.py and update the exception handling
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Around line 432-435: The current extension extraction using
remote_file.uri.split(".")[-1] is brittle for URIs with dots in directory names
(e.g., "s3://bucket/folder.name/file") and can return incorrect values; update
the logic that computes extension (the lines assigning extension and calling
FileType.from_extension) to parse only the path portion of the URI and then use
os.path.splitext or pathlib.PurePosixPath to get the suffix, e.g., obtain the
path via urllib.parse.urlparse(remote_file.uri).path (or strip any
query/fragment), call os.path.splitext or PurePosixPath(path).suffix to get a
single leading dot extension (lowercased), then pass that to
FileType.from_extension and keep the existing return behavior.

---

Nitpick comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Around line 420-427: The try/except around
detect_filetype(file_path=remote_file.uri) currently swallows all exceptions
which can hide unexpected failures; replace the broad except Exception with a
narrower except (FileNotFoundError, OSError) to only ignore missing/local-path
errors when detect_filetype is called with a remote URI, and let other
exceptions propagate (or re-raise/log them) so unexpected errors in
detect_filetype are visible; locate the block using detect_filetype and
remote_file.uri in unstructured_parser.py and update the exception handling
accordingly.

@github-actions
Copy link

github-actions bot commented Feb 18, 2026

PyTest Results (Fast)

3 869 tests  ±0   3 857 ✅ ±0   6m 40s ⏱️ -1s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 400b278. ± Comparison against base commit 2c7af15.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py (2)

352-352: Minor: redundant _import_unstructured() call.

_read_file_locally is only called from _read_file (line 210), which already calls _import_unstructured(). This second call is a no-op since the globals are already populated. It's harmless and defensive, but if you want to trim it for clarity, it could be removed — what do you think, wdyt?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py` at line
352, Remove the redundant call to _import_unstructured() inside
_read_file_locally: since _read_file (the only caller of _read_file_locally)
already invokes _import_unstructured(), the second call is unnecessary; update
_read_file_locally to rely on the global state initialized by
_import_unstructured() and delete the extra invocation to keep the code clearer
and avoid duplicate imports.

420-427: The detect_filetype(file_path=...) call with a remote URI will likely always fail and fall through.

Since remote_file.uri is typically a remote path (e.g., s3://...), detect_filetype(file_path=...) will probably always raise and hit the pass. This means the extension-based and content-based fallbacks below do the real work. Not a bug per se — but have you considered whether this block is dead code in practice, or are there cases where the URI is a local path, wdyt?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py` around
lines 420 - 427, The detect_filetype(file_path=remote_file.uri) call is invoking
a local-file detector on remote URIs (e.g., s3://...) and will generally always
raise and be a no-op; update the logic in unstructured_parser.py so
detect_filetype is only called for actual local paths: check remote_file.uri's
scheme (using urlparse) and/or verify filesystem existence (Path.exists() or
file:// scheme) before calling detect_filetype, otherwise skip straight to the
extension/content fallbacks; reference the detect_filetype call and the
remote_file.uri/file_type variables when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Line 15: The import is incorrect: FileType is defined in
unstructured.file_utils.model while detect_filetype lives in
unstructured.file_utils.filetype; update the imports in unstructured_parser.py
to import detect_filetype from unstructured.file_utils.filetype and FileType
from unstructured.file_utils.model (i.e., split the single import so that
detect_filetype and FileType are imported from their correct modules).

---

Duplicate comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Around line 432-435: The current extension extraction using
remote_file.uri.split(".") can misparse URIs with dots in directory names;
update the logic that sets extension/extension_type (where extension and
ext_type are computed and FileType.from_extension is called) to safely extract
the path portion of remote_file.uri (e.g., with urllib.parse.urlparse(...).path)
and then use pathlib.PurePosixPath(path).suffix.lower() (or os.path.splitext on
the path) to derive the extension before calling FileType.from_extension, and
handle the case of an empty suffix by returning None or falling back
appropriately.

---

Nitpick comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Line 352: Remove the redundant call to _import_unstructured() inside
_read_file_locally: since _read_file (the only caller of _read_file_locally)
already invokes _import_unstructured(), the second call is unnecessary; update
_read_file_locally to rely on the global state initialized by
_import_unstructured() and delete the extra invocation to keep the code clearer
and avoid duplicate imports.
- Around line 420-427: The detect_filetype(file_path=remote_file.uri) call is
invoking a local-file detector on remote URIs (e.g., s3://...) and will
generally always raise and be a no-op; update the logic in
unstructured_parser.py so detect_filetype is only called for actual local paths:
check remote_file.uri's scheme (using urlparse) and/or verify filesystem
existence (Path.exists() or file:// scheme) before calling detect_filetype,
otherwise skip straight to the extension/content fallbacks; reference the
detect_filetype call and the remote_file.uri/file_type variables when making
this change.

@github-actions
Copy link

github-actions bot commented Feb 18, 2026

PyTest Results (Full)

3 872 tests  ±0   3 860 ✅ +1   11m 20s ⏱️ +15s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌  - 1 

Results for commit 400b278. ± Comparison against base commit 2c7af15.

♻️ This comment has been updated with latest results.

devin-ai-integration bot and others added 3 commits February 18, 2026 15:18
…t changes

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
…ompatibility

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
unit_tests/sources/file_based/scenarios/unstructured_scenarios.py (2)

465-521: ⚠️ Potential issue | 🟡 Minor

corrupted_file_scenario no longer exercises corrupted-file handling — could it be reframed or split?

With the new lazy-import guard, PDF parsing now short-circuits to the unstructured_inference missing error before the file bytes are ever read. This means corrupted_file_scenario and simple_unstructured_scenario both traverse the exact same code path for PDFs. The "___ corrupted file ___" bytes are completely irrelevant to the outcome, and this scenario provides zero additional coverage over the PDF case in simple_unstructured_scenario.

Two options to consider — wdyt about either of these?

  1. Rename / reframe the scenario to something like pdf_without_inference_scenario to accurately describe what it's actually testing now.
  2. Add a companion scenario (guarded by a check that unstructured_inference is available) that validates the truly-corrupted-file error path — otherwise that branch is untested.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@unit_tests/sources/file_based/scenarios/unstructured_scenarios.py` around
lines 465 - 521, The test `corrupted_file_scenario` now only hits the
unstructured_inference-missing path (same as `simple_unstructured_scenario`)
because PDF parsing short-circuits before reading bytes; either rename the
scenario to reflect that (e.g., `pdf_without_inference_scenario`) by updating
the TestScenarioBuilder instance name and description, or add a second scenario
that actually exercises the corrupted-file path by creating a guarded test that
only runs when `unstructured_inference` is importable (use the same
FileBasedSourceBuilder payload with corrupted bytes and check the parse-error
message for a real PDF parsing failure), and keep `corrupted_file_scenario` or
replace it accordingly so both code paths are covered.

13-14: ⚠️ Potential issue | 🟡 Minor

Update NLTK resource names to match NLTK 3.9.1 compatibility.

The test file downloads "punkt" and "averaged_perceptron_tagger" (lines 13-14), but your production code in airbyte_cdk/sources/file_based/file_types/unstructured_parser.py already uses the NLTK 3.9+ resource names: "punkt_tab" and "averaged_perceptron_tagger_eng". With NLTK 3.9.1 pinned in poetry.lock, consider updating the test file to match:

nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("averaged_perceptron_tagger_eng")

Or, sync the test setup with your production initialization pattern for consistency. The old resource names may download successfully but populate the wrong data directories, potentially causing lookup errors at test runtime. Wdyt?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@unit_tests/sources/file_based/scenarios/unstructured_scenarios.py` around
lines 13 - 14, Update the NLTK resources downloaded in the test setup to match
NLTK 3.9.1 names used in production: replace or extend the existing
nltk.download calls so that the tests download "punkt_tab" and
"averaged_perceptron_tagger_eng" (keep "punkt" if desired for compatibility).
Locate the nltk.download calls in the test initialization (the lines currently
calling nltk.download("punkt") and nltk.download("averaged_perceptron_tagger"))
and change them to download the new resource names to ensure the test data
directories match the production parser (unstructured_parser.py) expectations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@unit_tests/sources/file_based/scenarios/unstructured_scenarios.py`:
- Around line 465-521: The test `corrupted_file_scenario` now only hits the
unstructured_inference-missing path (same as `simple_unstructured_scenario`)
because PDF parsing short-circuits before reading bytes; either rename the
scenario to reflect that (e.g., `pdf_without_inference_scenario`) by updating
the TestScenarioBuilder instance name and description, or add a second scenario
that actually exercises the corrupted-file path by creating a guarded test that
only runs when `unstructured_inference` is importable (use the same
FileBasedSourceBuilder payload with corrupted bytes and check the parse-error
message for a real PDF parsing failure), and keep `corrupted_file_scenario` or
replace it accordingly so both code paths are covered.
- Around line 13-14: Update the NLTK resources downloaded in the test setup to
match NLTK 3.9.1 names used in production: replace or extend the existing
nltk.download calls so that the tests download "punkt_tab" and
"averaged_perceptron_tagger_eng" (keep "punkt" if desired for compatibility).
Locate the nltk.download calls in the test initialization (the lines currently
calling nltk.download("punkt") and nltk.download("averaged_perceptron_tagger"))
and change them to download the new resource names to ensure the test data
directories match the production parser (unstructured_parser.py) expectations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant