fix: always decompress gzip when GzipDecoder is explicitly selected#909
fix: always decompress gzip when GzipDecoder is explicitly selected#909Lucas Leadbetter (lleadbet) wants to merge 1 commit intomainfrom
Conversation
Co-Authored-By: lucas.leadbetter@airbyte.io <lucas.leadbetter@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1771447353-fix-gzip-decoder-always-decompress#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1771447353-fix-gzip-decoder-always-decompressPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
📝 WalkthroughWalkthroughA single-line modification in the gzip decoder construction that changes the fallback parser argument from Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes One quick thought: Was this fallback behavior change intentional? The switch from using the inner parser to the outer parser might have implications for how decompression failures cascade—wdyt? It would be helpful to understand if there's specific test coverage validating this fallback path works as expected. 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2620-2646: Missing test for the fixed scenario — want me to draft one?The PR description itself flags this gap: there's no test covering a gzip-encoded response body without matching
Content-Encoding/Content-Typeheaders (the exact scenario that was broken). Given this is a targeted behavior fix, having at least one regression test would really help lock in the intent and prevent the bug from quietly returning. Would it be useful if I opened a follow-up issue or drafted the test inline?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py` around lines 2620 - 2646, The PR lacks a regression test for create_gzip_decoder covering the case where a response body is gzip-compressed but the response headers do not include the expected Content-Encoding/Content-Type values; add a unit/integration test that constructs a gzip-compressed response body with missing or non-matching headers and verifies CompositeRawDecoder.by_headers (and the fallback_parser behavior of create_gzip_decoder / gzip_parser) correctly decodes the payload, ensuring the previously broken path is exercised and asserted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py`:
- Around line 2642-2646: GzipParser.parse has a docstring that promises graceful
fallback to the inner parser for non-gzipped data but the implementation
directly wraps the input in gzip.GzipFile without handling BadGzipFile; change
GzipParser.parse to try reading/decompressing via gzip.GzipFile (or a small read
to trigger format validation) inside a try/except catching OSError/BadGzipFile,
and on exception reset the input pointer (call data.seek(0) or reopen the
original bytes stream) then call and return self.inner_parser.parse(data) so
non-gzip payloads are passed through as documented; keep raising other
unexpected exceptions.
---
Nitpick comments:
In `@airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py`:
- Around line 2620-2646: The PR lacks a regression test for create_gzip_decoder
covering the case where a response body is gzip-compressed but the response
headers do not include the expected Content-Encoding/Content-Type values; add a
unit/integration test that constructs a gzip-compressed response body with
missing or non-matching headers and verifies CompositeRawDecoder.by_headers (and
the fallback_parser behavior of create_gzip_decoder / gzip_parser) correctly
decodes the payload, ensuring the previously broken path is exercised and
asserted.
| return CompositeRawDecoder.by_headers( | ||
| [({"Content-Encoding", "Content-Type"}, _compressed_response_types, gzip_parser)], | ||
| stream_response=True, | ||
| fallback_parser=gzip_parser.inner_parser, | ||
| fallback_parser=gzip_parser, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find GzipParser class and its parse method
rg -n 'class GzipParser' -A 30 --type pyRepository: airbytehq/airbyte-python-cdk
Length of output: 3227
🏁 Script executed:
# Find CompositeRawDecoder class and by_headers method
rg -n 'class CompositeRawDecoder' -A 50 --type pyRepository: airbytehq/airbyte-python-cdk
Length of output: 5560
🏁 Script executed:
# Check the by_headers implementation more thoroughly
rg -n 'def by_headers' -B 5 -A 40 --type pyRepository: airbytehq/airbyte-python-cdk
Length of output: 5320
🏁 Script executed:
cat -n airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py | sed -n '34,60p'Repository: airbytehq/airbyte-python-cdk
Length of output: 1183
Fix looks correct — but found a pre-existing issue in GzipParser worth addressing.
The change properly ensures that when GzipDecoder is explicitly configured, decompression is always attempted regardless of response headers. Logic is sound: fallback is now the full gzip_parser instead of gzip_parser.inner_parser.
One thing flagged during verification:
GzipParser.parse() has a docstring-implementation mismatch. The docstring claims "If the data is not gzipped, reset the pointer and pass the data to the inner parser as is" (lines 38-39 in composite_raw_decoder.py), but the actual implementation (lines 45-46) just calls gzip.GzipFile(fileobj=data, mode="rb") with no error handling or pointer reset. When non-gzip data hits this, it will raise OSError (BadGzipFile) on first read, not gracefully fallback.
This is pre-existing, but your change makes it more likely to surface: with fallback_parser=gzip_parser, non-gzip data now goes through GzipParser.parse() instead of bypassing it. If a user configures GzipDecoder against an API returning plain (non-gzip) responses, they'll hit this hard crash.
Could you either:
- Fix
GzipParser.parse()to actually reset the pointer and pass plain data through (matching the docstring), or - Update the docstring to reflect that it will raise
OSErroron non-gzip input?
Pagination support is fine: CompositeRawDecoder.by_headers() returns cls(fallback_parser, ...) which sets self.parser = fallback_parser, so the decoder's parser property will correctly expose gzip_parser with its inner_parser. The pagination check should pass as expected.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py` around
lines 2642 - 2646, GzipParser.parse has a docstring that promises graceful
fallback to the inner parser for non-gzipped data but the implementation
directly wraps the input in gzip.GzipFile without handling BadGzipFile; change
GzipParser.parse to try reading/decompressing via gzip.GzipFile (or a small read
to trigger format validation) inside a try/except catching OSError/BadGzipFile,
and on exception reset the input pointer (call data.seek(0) or reopen the
original bytes stream) then call and return self.inner_parser.parse(data) so
non-gzip payloads are passed through as documented; keep raising other
unexpected exceptions.
There was a problem hiding this comment.
Good catch on the docstring-implementation mismatch in GzipParser.parse() — that's a real pre-existing issue.
However, I'm keeping this PR scoped to the minimal fix (changing the fallback parser) rather than expanding into GzipParser changes, for a few reasons:
- The scenario this PR fixes: the user explicitly selects
GzipDecoderbecause the API always returns gzip — so the fallback hittingGzipParseris correct behavior and will decompress successfully. - The edge case you describe (API conditionally returning non-gzip while
GzipDecoderis selected) is a separate concern that existed before this PR — the previousby_headerspath already usedGzipParserwhen headers matched, so a response without gzip encoding was already handled by the fallback (which before just skipped decompression silently, potentially producing garbage). - Fixing
GzipParser.parse()to gracefully handle non-gzip input is a good improvement but would change behavior for allGzipParserconsumers and warrants its own review.
I'd suggest either fixing the GzipParser docstring or implementing the graceful fallback as a follow-up. Happy to defer to the human reviewer on whether they'd prefer it bundled here.
Summary
One-line fix in
create_gzip_decoderto usegzip_parser(instead ofgzip_parser.inner_parser) as the fallback when response headers don't match known gzip content types.Problem: When a user explicitly selects
GzipDecoderin the Connector Builder, but the API returns gzipped data without standard gzip headers (Content-Encoding: gzip, etc.), the fallback parser skipped decompression entirely and passed raw gzip bytes to the inner parser (e.g.,CsvParser), causing'utf-8' codec can't decode byte 0x8b in position 1errors.Fix: Change
fallback_parser=gzip_parser.inner_parser→fallback_parser=gzip_parserso that when the user explicitly configures gzip decoding, decompression is always attempted regardless of response headers.Review & Testing Checklist for Human
GzipParserhandles non-gzipped data gracefully. TheGzipParser.parse()docstring (line 39 ofcomposite_raw_decoder.py) claims "If the data is not gzipped, reset the pointer and pass the data to the inner parser as is" — but the actual implementation has no try/except or fallback logic. If an API conditionally returns uncompressed data while the user hasGzipDecoderselected, this change could turn a working sync into agzip.BadGzipFileerror. This is the highest-risk item.GzipDecoderwhere the API does NOT set gzip headers — this is the scenario the fix targets. Confirm that the sync now succeeds instead of throwing the utf-8 decode error.GzipDecoderwhere the API DOES set proper gzip headers — confirm no regression in the happy path (headers match →gzip_parseris selected viaby_headers, not the fallback).Notes
_emit_connector_builder_messages=True, lines 2636-2640) is unchanged — that path has a separate mechanism that bypassesGzipParserentirely and relies onrequests.contentauto-decompression.Summary by CodeRabbit