fix(cdk): upgrade unstructured from 0.10.27 to 0.18.32#908
fix(cdk): upgrade unstructured from 0.10.27 to 0.18.32#908Ryan Waskewich (rwask) wants to merge 7 commits intomainfrom
Conversation
Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1771425511-bump-unstructured-to-latest#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1771425511-bump-unstructured-to-latestPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
…ck behavior Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
📝 WalkthroughWalkthroughRefactors unstructured file parsing to use per-filetype availability checks and lazy PDF import failure handling, changes filetype detection order (MIME → extension → content), updates multipart upload MIME usage to use Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Would you like a simple Mermaid sequence diagram showing the new detection and per-filetype availability flow? wdyt? 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py (1)
420-427:detect_filetype(file_path=...)with a remote URI — handled by try/except.Since
remote_file.uriis a remote path (e.g.,s3://...),detect_filetypewill likely fail trying to access it locally. The broadexcept Exception: passcatches this gracefully and falls through to extension-based detection. This works, but the silent swallowing of all exceptions could hide unexpected failures. Would it be worth narrowing toexcept (FileNotFoundError, OSError)to surface truly unexpected errors, wdyt?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py` around lines 420 - 427, The try/except around detect_filetype(file_path=remote_file.uri) currently swallows all exceptions which can hide unexpected failures; replace the broad except Exception with a narrower except (FileNotFoundError, OSError) to only ignore missing/local-path errors when detect_filetype is called with a remote URI, and let other exceptions propagate (or re-raise/log them) so unexpected errors in detect_filetype are visible; locate the block using detect_filetype and remote_file.uri in unstructured_parser.py and update the exception handling accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Around line 432-435: The current extension extraction using
remote_file.uri.split(".")[-1] is brittle for URIs with dots in directory names
(e.g., "s3://bucket/folder.name/file") and can return incorrect values; update
the logic that computes extension (the lines assigning extension and calling
FileType.from_extension) to parse only the path portion of the URI and then use
os.path.splitext or pathlib.PurePosixPath to get the suffix, e.g., obtain the
path via urllib.parse.urlparse(remote_file.uri).path (or strip any
query/fragment), call os.path.splitext or PurePosixPath(path).suffix to get a
single leading dot extension (lowercased), then pass that to
FileType.from_extension and keep the existing return behavior.
---
Nitpick comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Around line 420-427: The try/except around
detect_filetype(file_path=remote_file.uri) currently swallows all exceptions
which can hide unexpected failures; replace the broad except Exception with a
narrower except (FileNotFoundError, OSError) to only ignore missing/local-path
errors when detect_filetype is called with a remote URI, and let other
exceptions propagate (or re-raise/log them) so unexpected errors in
detect_filetype are visible; locate the block using detect_filetype and
remote_file.uri in unstructured_parser.py and update the exception handling
accordingly.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py (2)
352-352: Minor: redundant_import_unstructured()call.
_read_file_locallyis only called from_read_file(line 210), which already calls_import_unstructured(). This second call is a no-op since the globals are already populated. It's harmless and defensive, but if you want to trim it for clarity, it could be removed — what do you think, wdyt?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py` at line 352, Remove the redundant call to _import_unstructured() inside _read_file_locally: since _read_file (the only caller of _read_file_locally) already invokes _import_unstructured(), the second call is unnecessary; update _read_file_locally to rely on the global state initialized by _import_unstructured() and delete the extra invocation to keep the code clearer and avoid duplicate imports.
420-427: Thedetect_filetype(file_path=...)call with a remote URI will likely always fail and fall through.Since
remote_file.uriis typically a remote path (e.g.,s3://...),detect_filetype(file_path=...)will probably always raise and hit thepass. This means the extension-based and content-based fallbacks below do the real work. Not a bug per se — but have you considered whether this block is dead code in practice, or are there cases where the URI is a local path, wdyt?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py` around lines 420 - 427, The detect_filetype(file_path=remote_file.uri) call is invoking a local-file detector on remote URIs (e.g., s3://...) and will generally always raise and be a no-op; update the logic in unstructured_parser.py so detect_filetype is only called for actual local paths: check remote_file.uri's scheme (using urlparse) and/or verify filesystem existence (Path.exists() or file:// scheme) before calling detect_filetype, otherwise skip straight to the extension/content fallbacks; reference the detect_filetype call and the remote_file.uri/file_type variables when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Line 15: The import is incorrect: FileType is defined in
unstructured.file_utils.model while detect_filetype lives in
unstructured.file_utils.filetype; update the imports in unstructured_parser.py
to import detect_filetype from unstructured.file_utils.filetype and FileType
from unstructured.file_utils.model (i.e., split the single import so that
detect_filetype and FileType are imported from their correct modules).
---
Duplicate comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Around line 432-435: The current extension extraction using
remote_file.uri.split(".") can misparse URIs with dots in directory names;
update the logic that sets extension/extension_type (where extension and
ext_type are computed and FileType.from_extension is called) to safely extract
the path portion of remote_file.uri (e.g., with urllib.parse.urlparse(...).path)
and then use pathlib.PurePosixPath(path).suffix.lower() (or os.path.splitext on
the path) to derive the extension before calling FileType.from_extension, and
handle the case of an empty suffix by returning None or falling back
appropriately.
---
Nitpick comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Line 352: Remove the redundant call to _import_unstructured() inside
_read_file_locally: since _read_file (the only caller of _read_file_locally)
already invokes _import_unstructured(), the second call is unnecessary; update
_read_file_locally to rely on the global state initialized by
_import_unstructured() and delete the extra invocation to keep the code clearer
and avoid duplicate imports.
- Around line 420-427: The detect_filetype(file_path=remote_file.uri) call is
invoking a local-file detector on remote URIs (e.g., s3://...) and will
generally always raise and be a no-op; update the logic in
unstructured_parser.py so detect_filetype is only called for actual local paths:
check remote_file.uri's scheme (using urlparse) and/or verify filesystem
existence (Path.exists() or file:// scheme) before calling detect_filetype,
otherwise skip straight to the extension/content fallbacks; reference the
detect_filetype call and the remote_file.uri/file_type variables when making
this change.
…t changes Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
…ompatibility Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
unit_tests/sources/file_based/scenarios/unstructured_scenarios.py (2)
465-521:⚠️ Potential issue | 🟡 Minor
corrupted_file_scenariono longer exercises corrupted-file handling — could it be reframed or split?With the new lazy-import guard, PDF parsing now short-circuits to the
unstructured_inferencemissing error before the file bytes are ever read. This meanscorrupted_file_scenarioandsimple_unstructured_scenarioboth traverse the exact same code path for PDFs. The"___ corrupted file ___"bytes are completely irrelevant to the outcome, and this scenario provides zero additional coverage over the PDF case insimple_unstructured_scenario.Two options to consider — wdyt about either of these?
- Rename / reframe the scenario to something like
pdf_without_inference_scenarioto accurately describe what it's actually testing now.- Add a companion scenario (guarded by a check that
unstructured_inferenceis available) that validates the truly-corrupted-file error path — otherwise that branch is untested.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@unit_tests/sources/file_based/scenarios/unstructured_scenarios.py` around lines 465 - 521, The test `corrupted_file_scenario` now only hits the unstructured_inference-missing path (same as `simple_unstructured_scenario`) because PDF parsing short-circuits before reading bytes; either rename the scenario to reflect that (e.g., `pdf_without_inference_scenario`) by updating the TestScenarioBuilder instance name and description, or add a second scenario that actually exercises the corrupted-file path by creating a guarded test that only runs when `unstructured_inference` is importable (use the same FileBasedSourceBuilder payload with corrupted bytes and check the parse-error message for a real PDF parsing failure), and keep `corrupted_file_scenario` or replace it accordingly so both code paths are covered.
13-14:⚠️ Potential issue | 🟡 MinorUpdate NLTK resource names to match NLTK 3.9.1 compatibility.
The test file downloads
"punkt"and"averaged_perceptron_tagger"(lines 13-14), but your production code inairbyte_cdk/sources/file_based/file_types/unstructured_parser.pyalready uses the NLTK 3.9+ resource names:"punkt_tab"and"averaged_perceptron_tagger_eng". With NLTK 3.9.1 pinned inpoetry.lock, consider updating the test file to match:nltk.download("punkt") nltk.download("punkt_tab") nltk.download("averaged_perceptron_tagger_eng")Or, sync the test setup with your production initialization pattern for consistency. The old resource names may download successfully but populate the wrong data directories, potentially causing lookup errors at test runtime. Wdyt?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@unit_tests/sources/file_based/scenarios/unstructured_scenarios.py` around lines 13 - 14, Update the NLTK resources downloaded in the test setup to match NLTK 3.9.1 names used in production: replace or extend the existing nltk.download calls so that the tests download "punkt_tab" and "averaged_perceptron_tagger_eng" (keep "punkt" if desired for compatibility). Locate the nltk.download calls in the test initialization (the lines currently calling nltk.download("punkt") and nltk.download("averaged_perceptron_tagger")) and change them to download the new resource names to ensure the test data directories match the production parser (unstructured_parser.py) expectations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@unit_tests/sources/file_based/scenarios/unstructured_scenarios.py`:
- Around line 465-521: The test `corrupted_file_scenario` now only hits the
unstructured_inference-missing path (same as `simple_unstructured_scenario`)
because PDF parsing short-circuits before reading bytes; either rename the
scenario to reflect that (e.g., `pdf_without_inference_scenario`) by updating
the TestScenarioBuilder instance name and description, or add a second scenario
that actually exercises the corrupted-file path by creating a guarded test that
only runs when `unstructured_inference` is importable (use the same
FileBasedSourceBuilder payload with corrupted bytes and check the parse-error
message for a real PDF parsing failure), and keep `corrupted_file_scenario` or
replace it accordingly so both code paths are covered.
- Around line 13-14: Update the NLTK resources downloaded in the test setup to
match NLTK 3.9.1 names used in production: replace or extend the existing
nltk.download calls so that the tests download "punkt_tab" and
"averaged_perceptron_tagger_eng" (keep "punkt" if desired for compatibility).
Locate the nltk.download calls in the test initialization (the lines currently
calling nltk.download("punkt") and nltk.download("averaged_perceptron_tagger"))
and change them to download the new resource names to ensure the test data
directories match the production parser (unstructured_parser.py) expectations.
fix(cdk): upgrade unstructured from 0.10.27 to 0.18.32
Summary
Bumps the
unstructureddocument parsing library from 0.10.27 to 0.18.32 in the CDK'sfile-basedextra. This is a large version jump (8 minor versions) that required migrating several removed/changed APIs inunstructured_parser.py:EXT_TO_FILETYPE,FILETYPE_TO_MIMETYPE,STR_TO_FILETYPE) → replaced withFileType.from_extension(),filetype.mime_type,FileType.from_mime_type()detect_filetypeparameter renamed:filename=→file_path=partition_pdfnow requiresunstructured_inference: wrapped import in try/except so DOCX/PPTX parsing still works without it_get_filetypedetection order changed: extension-based detection now runs before content sniffing (was the opposite)Updates since last revision
FileType.from_mime_type()fallthrough:from_mime_type()returnsNonefor unknown types (notValueErroras initially assumed). Added null check andFileType.UNKguard so files with ambiguous MIME types (e.g.,application/octet-stream) correctly fall through to extension/content-based detection instead of returningNoneimmediately.unstructured.partition.pdfcan no longer be imported withoutunstructured_inference, so test@patchdecorators now target the global variables inunstructured_parserinstead of the source modules. Added_import_unstructuredmock to prevent the real import from overwriting test mocks.pi-heifdependency: Per CodeRabbit feedback, removed thepi-heifoptional dependency as it's not directly imported by the CDK.pdfminer.sixpin: Changed from exact20221105to>=20231228for compatibility with unstructured 0.18.32. Note: unstructured 0.18.32's PDF module imports frompdfminer.psexceptionswhich was added in pdfminer.six20250327. If PDF parsing is needed, ensurepdfminer.six>=20250327is installed (this happens automatically whenunstructured[pdf]is installed).Production Impact — Backward Compatibility Scope
Queried the production database to assess the blast radius. Of the original ~610 source actors flagged by a broad text search for
document_file_type_handler, only 115 connections across 69 workspaces actually have streams configured with"filetype": "unstructured".Connections by Connector (total):
Sync Recency:
Only 12 connections are actively syncing today with unstructured parsing. The other 103 connections either:
Breaking Changes for Active Connections
For the ~12 active connections, the following will break:
unstructured_inference_ab_source_file_parse_errorinstead ofcontent"# Content"→"Content"(markdown heading removed)unstructured[pdf]extralibGL.so.1andlibglib2.0-0required for PDF inferenceapt-get install libgl1-mesa-glx libglib2.0-0Upgrade Path for Affected Customers
For PDF parsing (local mode):
unstructured[pdf]instead of justunstructured[docx,pptx]apt-get install -y libgl1-mesa-glx libglib2.0-0pdfminer.six>=20250327is installedpip install torch --index-url https://download.pytorch.org/whl/cpubefore installing unstructured (reduces ~10GB)For PDF parsing (API mode):
unstructured_inferenceFor DOCX/PPTX only:
unstructured_inferenceReview & Testing Checklist for Human
unstructured_inference: This is a breaking change. PDFs now emit_ab_source_file_parse_errorinstead ofcontentunlessunstructured_inferenceis installed. Verify this is acceptable for downstream connectors (source-s3, source-gcs, source-sharepoint-enterprise, etc.). The scenario tests have been updated to expect parse errors for PDFs.pdfminer.sixversion compatibility: The pin is>=20231228butpdfminer.psexceptions(required by unstructured 0.18.32's PDF module) was added in20250327. If someone installsunstructured[pdf]with a pdfminer.six version between these, PDF parsing will fail. Consider tightening the pin to>=20250327._get_filetypedetection order change: extension-based detection (FileType.from_extension) now runs before content sniffing (detect_filetype(file=...)). Confirm this doesn't change behavior for ambiguous files."# Content"→"Content"(markdown heading removed). Confirm this is expected behavior from the unstructured upgrade.Notes
devin/1771342600-bump-unstructured-0.18.18with similar changes targeting 0.18.18. This PR targets the latest (0.18.32) instead and includes additional fixes (correctfrom_mime_typehandling,pdfminer.sixpin update).partition_pdfimport is now gracefully handled — ifunstructured_inferenceisn't installed, PDF parsing will be unavailable but DOCX/PPTX will still work. This is a behavioral change from the old code which required all three partition functions to be available.aiofiles,unstructured-client,webencodings, etc.)test_unstructured_parser.py) pass locally with the new version.Link to Devin run: https://app.devin.ai/sessions/c5bdff87617345b0bdbe574512f84953
Requested by: Ryan Waskewich (@rwask)
Summary by CodeRabbit