fix: always decompress gzip when GzipDecoder is explicitly selected by lleadbet · Pull Request #909 · airbytehq/airbyte-python-cdk

Lucas Leadbetter (lleadbet) · 2026-02-18T20:44:37Z

Summary

One-line fix in create_gzip_decoder to use gzip_parser (instead of gzip_parser.inner_parser) as the fallback when response headers don't match known gzip content types.

Problem: When a user explicitly selects GzipDecoder in the Connector Builder, but the API returns gzipped data without standard gzip headers (Content-Encoding: gzip, etc.), the fallback parser skipped decompression entirely and passed raw gzip bytes to the inner parser (e.g., CsvParser), causing 'utf-8' codec can't decode byte 0x8b in position 1 errors.

Fix: Change fallback_parser=gzip_parser.inner_parser → fallback_parser=gzip_parser so that when the user explicitly configures gzip decoding, decompression is always attempted regardless of response headers.

Review & Testing Checklist for Human

Verify GzipParser handles non-gzipped data gracefully. The GzipParser.parse() docstring (line 39 of composite_raw_decoder.py) claims "If the data is not gzipped, reset the pointer and pass the data to the inner parser as is" — but the actual implementation has no try/except or fallback logic. If an API conditionally returns uncompressed data while the user has GzipDecoder selected, this change could turn a working sync into a gzip.BadGzipFile error. This is the highest-risk item.
Test with a real connector that uses GzipDecoder where the API does NOT set gzip headers — this is the scenario the fix targets. Confirm that the sync now succeeds instead of throwing the utf-8 decode error.
Test with a real connector that uses GzipDecoder where the API DOES set proper gzip headers — confirm no regression in the happy path (headers match → gzip_parser is selected via by_headers, not the fallback).

Notes

The Connector Builder testing path (_emit_connector_builder_messages=True, lines 2636-2640) is unchanged — that path has a separate mechanism that bypasses GzipParser entirely and relies on requests.content auto-decompression.
No new tests were added. Consider adding a test for the specific scenario: gzip-compressed body with no matching response headers.
Requested by: Lucas Leadbetter (@lleadbet)
Link to Devin run

Summary by CodeRabbit

Bug Fixes
- Improved gzip decompression fallback behavior to handle edge cases more robustly.

Co-Authored-By: lucas.leadbetter@airbyte.io <lucas.leadbetter@gmail.com>

devin-ai-integration · 2026-02-18T20:44:42Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

github-actions · 2026-02-18T20:45:08Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1771447353-fix-gzip-decoder-always-decompress#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1771447353-fix-gzip-decoder-always-decompress

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/poetry-lock - Updates poetry.lock file
/test - Runs connector tests with the updated CDK
/prerelease - Triggers a prerelease publish with default arguments
/poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
/poe <command> - Runs any poe command in the CDK environment

📚 Show Repo Guidance

Helpful Resources

CDK API Reference

📝 Edit this welcome message.

coderabbitai · 2026-02-18T20:49:00Z

📝 Walkthrough

Walkthrough

A single-line modification in the gzip decoder construction that changes the fallback parser argument from gzip_parser.inner_parser to gzip_parser within the CompositeRawDecoder.by_headers call. This affects the fallback decompression behavior when handling gzip content.

Changes

Cohort / File(s)	Summary
Gzip Decoder Fallback Logic `airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py`	Modified the `create_gzip_decoder` function to use the outer `GzipParser` as fallback instead of its inner parser when decompressing gzip content via `CompositeRawDecoder.by_headers`.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

One quick thought: Was this fallback behavior change intentional? The switch from using the inner parser to the outer parser might have implications for how decompression failures cascade—wdyt? It would be helpful to understand if there's specific test coverage validating this fallback path works as expected.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and clearly describes the main change: switching the fallback parser in gzip decoder from inner_parser to the outer parser to ensure decompression happens when GzipDecoder is explicitly selected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch devin/1771447353-fix-gzip-decoder-always-decompress

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2620-2646: Missing test for the fixed scenario — want me to draft one?

The PR description itself flags this gap: there's no test covering a gzip-encoded response body without matching Content-Encoding/Content-Type headers (the exact scenario that was broken). Given this is a targeted behavior fix, having at least one regression test would really help lock in the intent and prevent the bug from quietly returning. Would it be useful if I opened a follow-up issue or drafted the test inline?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py` around
lines 2620 - 2646, The PR lacks a regression test for create_gzip_decoder
covering the case where a response body is gzip-compressed but the response
headers do not include the expected Content-Encoding/Content-Type values; add a
unit/integration test that constructs a gzip-compressed response body with
missing or non-matching headers and verifies CompositeRawDecoder.by_headers (and
the fallback_parser behavior of create_gzip_decoder / gzip_parser) correctly
decodes the payload, ensuring the previously broken path is exercised and
asserted.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py`:
- Around line 2642-2646: GzipParser.parse has a docstring that promises graceful
fallback to the inner parser for non-gzipped data but the implementation
directly wraps the input in gzip.GzipFile without handling BadGzipFile; change
GzipParser.parse to try reading/decompressing via gzip.GzipFile (or a small read
to trigger format validation) inside a try/except catching OSError/BadGzipFile,
and on exception reset the input pointer (call data.seek(0) or reopen the
original bytes stream) then call and return self.inner_parser.parse(data) so
non-gzip payloads are passed through as documented; keep raising other
unexpected exceptions.

---

Nitpick comments:
In `@airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py`:
- Around line 2620-2646: The PR lacks a regression test for create_gzip_decoder
covering the case where a response body is gzip-compressed but the response
headers do not include the expected Content-Encoding/Content-Type values; add a
unit/integration test that constructs a gzip-compressed response body with
missing or non-matching headers and verifies CompositeRawDecoder.by_headers (and
the fallback_parser behavior of create_gzip_decoder / gzip_parser) correctly
decodes the payload, ensuring the previously broken path is exercised and
asserted.

coderabbitai · 2026-02-18T20:49:03Z

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

        return CompositeRawDecoder.by_headers(
            [({"Content-Encoding", "Content-Type"}, _compressed_response_types, gzip_parser)],
            stream_response=True,
-            fallback_parser=gzip_parser.inner_parser,
+            fallback_parser=gzip_parser,
        )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find GzipParser class and its parse method rg -n 'class GzipParser' -A 30 --type py

Repository: airbytehq/airbyte-python-cdk

Length of output: 3227

🏁 Script executed:

# Find CompositeRawDecoder class and by_headers method rg -n 'class CompositeRawDecoder' -A 50 --type py

Repository: airbytehq/airbyte-python-cdk

Length of output: 5560

🏁 Script executed:

# Check the by_headers implementation more thoroughly rg -n 'def by_headers' -B 5 -A 40 --type py

Repository: airbytehq/airbyte-python-cdk

Length of output: 5320

🏁 Script executed:

cat -n airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py | sed -n '34,60p'

Repository: airbytehq/airbyte-python-cdk

Length of output: 1183

Fix looks correct — but found a pre-existing issue in GzipParser worth addressing.

The change properly ensures that when GzipDecoder is explicitly configured, decompression is always attempted regardless of response headers. Logic is sound: fallback is now the full gzip_parser instead of gzip_parser.inner_parser.

One thing flagged during verification:

GzipParser.parse() has a docstring-implementation mismatch. The docstring claims "If the data is not gzipped, reset the pointer and pass the data to the inner parser as is" (lines 38-39 in composite_raw_decoder.py), but the actual implementation (lines 45-46) just calls gzip.GzipFile(fileobj=data, mode="rb") with no error handling or pointer reset. When non-gzip data hits this, it will raise OSError (BadGzipFile) on first read, not gracefully fallback.

This is pre-existing, but your change makes it more likely to surface: with fallback_parser=gzip_parser, non-gzip data now goes through GzipParser.parse() instead of bypassing it. If a user configures GzipDecoder against an API returning plain (non-gzip) responses, they'll hit this hard crash.

Could you either:

Fix GzipParser.parse() to actually reset the pointer and pass plain data through (matching the docstring), or

Update the docstring to reflect that it will raise OSError on non-gzip input?

Pagination support is fine: CompositeRawDecoder.by_headers() returns cls(fallback_parser, ...) which sets self.parser = fallback_parser, so the decoder's parser property will correctly expose gzip_parser with its inner_parser. The pagination check should pass as expected.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py` around lines 2642 - 2646, GzipParser.parse has a docstring that promises graceful fallback to the inner parser for non-gzipped data but the implementation directly wraps the input in gzip.GzipFile without handling BadGzipFile; change GzipParser.parse to try reading/decompressing via gzip.GzipFile (or a small read to trigger format validation) inside a try/except catching OSError/BadGzipFile, and on exception reset the input pointer (call data.seek(0) or reopen the original bytes stream) then call and return self.inner_parser.parse(data) so non-gzip payloads are passed through as documented; keep raising other unexpected exceptions.

Good catch on the docstring-implementation mismatch in GzipParser.parse() — that's a real pre-existing issue.

However, I'm keeping this PR scoped to the minimal fix (changing the fallback parser) rather than expanding into GzipParser changes, for a few reasons:

The scenario this PR fixes: the user explicitly selects GzipDecoder because the API always returns gzip — so the fallback hitting GzipParser is correct behavior and will decompress successfully.

The edge case you describe (API conditionally returning non-gzip while GzipDecoder is selected) is a separate concern that existed before this PR — the previous by_headers path already used GzipParser when headers matched, so a response without gzip encoding was already handled by the fallback (which before just skipped decompression silently, potentially producing garbage).

Fixing GzipParser.parse() to gracefully handle non-gzip input is a good improvement but would change behavior for all GzipParser consumers and warrants its own review.

I'd suggest either fixing the GzipParser docstring or implementing the graceful fallback as a follow-up. Happy to defer to the human reviewer on whether they'd prefer it bundled here.

Devin session

github-actions · 2026-02-18T20:52:42Z

PyTest Results (Fast)

3 869 tests ±0 3 857 ✅ ±0 6m 45s ⏱️ +4s
1 suites ±0 12 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 9a3abd4. ± Comparison against base commit 2c7af15.

github-actions · 2026-02-18T20:56:58Z

PyTest Results (Full)

3 872 tests ±0 3 860 ✅ +1 11m 5s ⏱️ ±0s
1 suites ±0 12 💤 ±0
1 files ±0 0 ❌ - 1

Results for commit 9a3abd4. ± Comparison against base commit 2c7af15.

fix: always decompress gzip when GzipDecoder is explicitly selected

9a3abd4

Co-Authored-By: lucas.leadbetter@airbyte.io <lucas.leadbetter@gmail.com>

devin-ai-integration bot assigned Lucas Leadbetter (lleadbet) Feb 18, 2026

coderabbitai bot requested changes Feb 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: always decompress gzip when GzipDecoder is explicitly selected#909

fix: always decompress gzip when GzipDecoder is explicitly selected#909
Lucas Leadbetter (lleadbet) wants to merge 1 commit intomainfrom
devin/1771447353-fix-gzip-decoder-always-decompress

Lucas Leadbetter (lleadbet) commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

devin-ai-integration bot commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Testing This CDK Version

PR Slash Commands

Helpful Resources

Uh oh!

coderabbitai bot commented Feb 18, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 18, 2026

Uh oh!

devin-ai-integration bot Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Lucas Leadbetter (lleadbet) commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review & Testing Checklist for Human

Notes

Summary by CodeRabbit

Uh oh!

devin-ai-integration bot commented Feb 18, 2026

🤖 Devin AI Engineer

Uh oh!

github-actions bot commented Feb 18, 2026

👋 Greetings, Airbyte Team Member!

Testing This CDK Version

PR Slash Commands

Helpful Resources

Uh oh!

coderabbitai bot commented Feb 18, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 18, 2026

PyTest Results (Fast)

Uh oh!

github-actions bot commented Feb 18, 2026

PyTest Results (Full)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Lucas Leadbetter (lleadbet) commented Feb 18, 2026 •

edited by coderabbitai bot

Loading