Skip to content

Comments

fix: always decompress gzip when GzipDecoder is explicitly selected#909

Open
Lucas Leadbetter (lleadbet) wants to merge 1 commit intomainfrom
devin/1771447353-fix-gzip-decoder-always-decompress
Open

fix: always decompress gzip when GzipDecoder is explicitly selected#909
Lucas Leadbetter (lleadbet) wants to merge 1 commit intomainfrom
devin/1771447353-fix-gzip-decoder-always-decompress

Conversation

@lleadbet
Copy link
Contributor

@lleadbet Lucas Leadbetter (lleadbet) commented Feb 18, 2026

Summary

One-line fix in create_gzip_decoder to use gzip_parser (instead of gzip_parser.inner_parser) as the fallback when response headers don't match known gzip content types.

Problem: When a user explicitly selects GzipDecoder in the Connector Builder, but the API returns gzipped data without standard gzip headers (Content-Encoding: gzip, etc.), the fallback parser skipped decompression entirely and passed raw gzip bytes to the inner parser (e.g., CsvParser), causing 'utf-8' codec can't decode byte 0x8b in position 1 errors.

Fix: Change fallback_parser=gzip_parser.inner_parserfallback_parser=gzip_parser so that when the user explicitly configures gzip decoding, decompression is always attempted regardless of response headers.

Review & Testing Checklist for Human

  • Verify GzipParser handles non-gzipped data gracefully. The GzipParser.parse() docstring (line 39 of composite_raw_decoder.py) claims "If the data is not gzipped, reset the pointer and pass the data to the inner parser as is" — but the actual implementation has no try/except or fallback logic. If an API conditionally returns uncompressed data while the user has GzipDecoder selected, this change could turn a working sync into a gzip.BadGzipFile error. This is the highest-risk item.
  • Test with a real connector that uses GzipDecoder where the API does NOT set gzip headers — this is the scenario the fix targets. Confirm that the sync now succeeds instead of throwing the utf-8 decode error.
  • Test with a real connector that uses GzipDecoder where the API DOES set proper gzip headers — confirm no regression in the happy path (headers match → gzip_parser is selected via by_headers, not the fallback).

Notes

  • The Connector Builder testing path (_emit_connector_builder_messages=True, lines 2636-2640) is unchanged — that path has a separate mechanism that bypasses GzipParser entirely and relies on requests.content auto-decompression.
  • No new tests were added. Consider adding a test for the specific scenario: gzip-compressed body with no matching response headers.
  • Requested by: Lucas Leadbetter (@lleadbet)
  • Link to Devin run

Summary by CodeRabbit

  • Bug Fixes
    • Improved gzip decompression fallback behavior to handle edge cases more robustly.

Co-Authored-By: lucas.leadbetter@airbyte.io <lucas.leadbetter@gmail.com>
@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1771447353-fix-gzip-decoder-always-decompress#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1771447353-fix-gzip-decoder-always-decompress

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

A single-line modification in the gzip decoder construction that changes the fallback parser argument from gzip_parser.inner_parser to gzip_parser within the CompositeRawDecoder.by_headers call. This affects the fallback decompression behavior when handling gzip content.

Changes

Cohort / File(s) Summary
Gzip Decoder Fallback Logic
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Modified the create_gzip_decoder function to use the outer GzipParser as fallback instead of its inner parser when decompressing gzip content via CompositeRawDecoder.by_headers.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes


One quick thought: Was this fallback behavior change intentional? The switch from using the inner parser to the outer parser might have implications for how decompression failures cascade—wdyt? It would be helpful to understand if there's specific test coverage validating this fallback path works as expected.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main change: switching the fallback parser in gzip decoder from inner_parser to the outer parser to ensure decompression happens when GzipDecoder is explicitly selected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1771447353-fix-gzip-decoder-always-decompress

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

2620-2646: Missing test for the fixed scenario — want me to draft one?

The PR description itself flags this gap: there's no test covering a gzip-encoded response body without matching Content-Encoding/Content-Type headers (the exact scenario that was broken). Given this is a targeted behavior fix, having at least one regression test would really help lock in the intent and prevent the bug from quietly returning. Would it be useful if I opened a follow-up issue or drafted the test inline?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py` around
lines 2620 - 2646, The PR lacks a regression test for create_gzip_decoder
covering the case where a response body is gzip-compressed but the response
headers do not include the expected Content-Encoding/Content-Type values; add a
unit/integration test that constructs a gzip-compressed response body with
missing or non-matching headers and verifies CompositeRawDecoder.by_headers (and
the fallback_parser behavior of create_gzip_decoder / gzip_parser) correctly
decodes the payload, ensuring the previously broken path is exercised and
asserted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py`:
- Around line 2642-2646: GzipParser.parse has a docstring that promises graceful
fallback to the inner parser for non-gzipped data but the implementation
directly wraps the input in gzip.GzipFile without handling BadGzipFile; change
GzipParser.parse to try reading/decompressing via gzip.GzipFile (or a small read
to trigger format validation) inside a try/except catching OSError/BadGzipFile,
and on exception reset the input pointer (call data.seek(0) or reopen the
original bytes stream) then call and return self.inner_parser.parse(data) so
non-gzip payloads are passed through as documented; keep raising other
unexpected exceptions.

---

Nitpick comments:
In `@airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py`:
- Around line 2620-2646: The PR lacks a regression test for create_gzip_decoder
covering the case where a response body is gzip-compressed but the response
headers do not include the expected Content-Encoding/Content-Type values; add a
unit/integration test that constructs a gzip-compressed response body with
missing or non-matching headers and verifies CompositeRawDecoder.by_headers (and
the fallback_parser behavior of create_gzip_decoder / gzip_parser) correctly
decodes the payload, ensuring the previously broken path is exercised and
asserted.

Comment on lines 2642 to 2646
return CompositeRawDecoder.by_headers(
[({"Content-Encoding", "Content-Type"}, _compressed_response_types, gzip_parser)],
stream_response=True,
fallback_parser=gzip_parser.inner_parser,
fallback_parser=gzip_parser,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find GzipParser class and its parse method
rg -n 'class GzipParser' -A 30 --type py

Repository: airbytehq/airbyte-python-cdk

Length of output: 3227


🏁 Script executed:

# Find CompositeRawDecoder class and by_headers method
rg -n 'class CompositeRawDecoder' -A 50 --type py

Repository: airbytehq/airbyte-python-cdk

Length of output: 5560


🏁 Script executed:

# Check the by_headers implementation more thoroughly
rg -n 'def by_headers' -B 5 -A 40 --type py

Repository: airbytehq/airbyte-python-cdk

Length of output: 5320


🏁 Script executed:

cat -n airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py | sed -n '34,60p'

Repository: airbytehq/airbyte-python-cdk

Length of output: 1183


Fix looks correct — but found a pre-existing issue in GzipParser worth addressing.

The change properly ensures that when GzipDecoder is explicitly configured, decompression is always attempted regardless of response headers. Logic is sound: fallback is now the full gzip_parser instead of gzip_parser.inner_parser.

One thing flagged during verification:

GzipParser.parse() has a docstring-implementation mismatch. The docstring claims "If the data is not gzipped, reset the pointer and pass the data to the inner parser as is" (lines 38-39 in composite_raw_decoder.py), but the actual implementation (lines 45-46) just calls gzip.GzipFile(fileobj=data, mode="rb") with no error handling or pointer reset. When non-gzip data hits this, it will raise OSError (BadGzipFile) on first read, not gracefully fallback.

This is pre-existing, but your change makes it more likely to surface: with fallback_parser=gzip_parser, non-gzip data now goes through GzipParser.parse() instead of bypassing it. If a user configures GzipDecoder against an API returning plain (non-gzip) responses, they'll hit this hard crash.

Could you either:

  • Fix GzipParser.parse() to actually reset the pointer and pass plain data through (matching the docstring), or
  • Update the docstring to reflect that it will raise OSError on non-gzip input?

Pagination support is fine: CompositeRawDecoder.by_headers() returns cls(fallback_parser, ...) which sets self.parser = fallback_parser, so the decoder's parser property will correctly expose gzip_parser with its inner_parser. The pagination check should pass as expected.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py` around
lines 2642 - 2646, GzipParser.parse has a docstring that promises graceful
fallback to the inner parser for non-gzipped data but the implementation
directly wraps the input in gzip.GzipFile without handling BadGzipFile; change
GzipParser.parse to try reading/decompressing via gzip.GzipFile (or a small read
to trigger format validation) inside a try/except catching OSError/BadGzipFile,
and on exception reset the input pointer (call data.seek(0) or reopen the
original bytes stream) then call and return self.inner_parser.parse(data) so
non-gzip payloads are passed through as documented; keep raising other
unexpected exceptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the docstring-implementation mismatch in GzipParser.parse() — that's a real pre-existing issue.

However, I'm keeping this PR scoped to the minimal fix (changing the fallback parser) rather than expanding into GzipParser changes, for a few reasons:

  1. The scenario this PR fixes: the user explicitly selects GzipDecoder because the API always returns gzip — so the fallback hitting GzipParser is correct behavior and will decompress successfully.
  2. The edge case you describe (API conditionally returning non-gzip while GzipDecoder is selected) is a separate concern that existed before this PR — the previous by_headers path already used GzipParser when headers matched, so a response without gzip encoding was already handled by the fallback (which before just skipped decompression silently, potentially producing garbage).
  3. Fixing GzipParser.parse() to gracefully handle non-gzip input is a good improvement but would change behavior for all GzipParser consumers and warrants its own review.

I'd suggest either fixing the GzipParser docstring or implementing the graceful fallback as a follow-up. Happy to defer to the human reviewer on whether they'd prefer it bundled here.


Devin session

@github-actions
Copy link

PyTest Results (Fast)

3 869 tests  ±0   3 857 ✅ ±0   6m 45s ⏱️ +4s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 9a3abd4. ± Comparison against base commit 2c7af15.

@github-actions
Copy link

PyTest Results (Full)

3 872 tests  ±0   3 860 ✅ +1   11m 5s ⏱️ ±0s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌  - 1 

Results for commit 9a3abd4. ± Comparison against base commit 2c7af15.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant