Skip to content

Comments

feat: add intermediate state checkpointing during pagination#915

Draft
devin-ai-integration[bot] wants to merge 3 commits intomainfrom
devin/1771602439-intermediate-state-checkpoint
Draft

feat: add intermediate state checkpointing during pagination#915
devin-ai-integration[bot] wants to merge 3 commits intomainfrom
devin/1771602439-intermediate-state-checkpoint

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Feb 20, 2026

feat: add intermediate state checkpointing during pagination

Summary

When a stream paginates through many pages within a single slice/partition, state is currently only emitted when the partition closes. If the sync fails mid-pagination (e.g., rate limits, 504 errors), all progress is lost.

This PR adds intermediate state checkpointing to the CDK: when ConcurrentCursor detects that records are arriving in ascending cursor order, it will emit a state checkpoint every N pages. On the next sync, the stream resumes from the last checkpoint rather than restarting from the beginning of the slice.

Motivation: airbytehq/oncall#11335source-zendesk-support ticket_comments stream loses ~25k records of progress on each failure because no state is emitted during pagination.

Changes:

  • Declarative schema (declarative_component_schema.yaml) — Added pages_per_checkpoint_interval (optional integer) to both DatetimeBasedCursor and IncrementingCountCursor definitions. Defaults to disabled (no intermediate checkpointing unless explicitly configured).
  • Generated models (declarative_component_schema.py) — Updated DatetimeBasedCursor and IncrementingCountCursor model classes with the new pages_per_checkpoint_interval field.
  • ConcurrentCursor.emit_intermediate_state(stream_slice) — New method that adds a partial [start, cursor_value] slice to state and emits a state message, but only when _is_ascending_order is True. Handles both streams with and without slice_boundary_fields.
  • PaginationTracker — Extended with checkpoint_cursor and pages_per_checkpoint_interval params. New on_page_complete() method increments a page counter and triggers intermediate checkpoint when the interval is reached.
  • SimpleRetriever._read_pages() — Calls pagination_tracker.on_page_complete(stream_slice) after each successful page.
  • model_to_component_factory._create_pagination_tracker_factory() — Now reads pages_per_checkpoint_interval from the incremental sync model (if present) and passes it through to PaginationTracker. The feature is only active when a ConcurrentCursor is present AND the schema value is set.

Safety: The feature is a no-op when records are not in ascending order (the cursor tracks this via _is_ascending_order). The merge_intervals call ensures intermediate slices are correctly merged with the final partition close. When not configured in the schema, behavior is unchanged from before this PR.

Review & Testing Checklist for Human

  • Interaction with close_partition for intermediate slices — When emit_intermediate_state adds a partial slice, then close_partition also adds a slice for the same range. The merge should combine them correctly, but this interaction is only unit-tested in isolation. Verify end-to-end that state doesn't get corrupted or duplicated after a full partition lifecycle with intermediate checkpoints.
  • Thread safety of _is_ascending_order checkemit_intermediate_state() reads self._is_ascending_order outside the lock, but observe() writes it without a lock. Likely benign (flag only transitions True→False), but worth verifying no race exists.
  • _page_count never resets — Unlike _record_count, the _page_count in PaginationTracker is never reset (even in _reset()). This means checkpoint intervals span pagination resets. Is this intended?
  • Schema field discoverability — Connector developers need to understand that pages_per_checkpoint_interval only works when records are in ascending cursor order. The schema description mentions this, but consider whether additional documentation is needed.
  • Generated models were manually updated — The Python models in declarative_component_schema.py were manually edited rather than regenerated via bin/generate_component_manifest_files.py. Verify the manual changes match what the code generator would produce.

Recommended test plan:

  1. ✅ Unit tests for ConcurrentCursor.emit_intermediate_state() — covering ascending/non-ascending order, with/without boundary fields
  2. ✅ Unit tests for PaginationTracker.on_page_complete() — verifying page counting and checkpoint triggering
  3. Integration test: Mock a stream that paginates 20 pages with pages_per_checkpoint_interval: 5, verify state is emitted at pages 5, 10, 15, 20
  4. Manual test with source-zendesk-support or similar connector: configure pages_per_checkpoint_interval in the manifest, verify state advances during pagination, and that a mid-pagination failure resumes from the last checkpoint

Notes

  • This feature is now opt-in via declarative schema. Connectors must explicitly set pages_per_checkpoint_interval on their incremental sync cursor to enable intermediate checkpointing. When not set, behavior is unchanged.
  • The feature is safe by design: if records aren't sorted, it's a no-op. If there's any issue with intermediate checkpoints, the final close_partition still emits the full slice state.
  • Lambda closure in _create_pagination_tracker_factory captures the actual cursor (not a copy), so all PaginationTracker instances share the same cursor reference. This is intended — the lock in emit_intermediate_state handles concurrent access.

Devin session
Requested by: gl_anatolii.yatsuk@airbyte.io

When records are sorted in ascending order by cursor field, the CDK
will now emit state checkpoints every N pages (default: 5) during
pagination within a partition. This prevents loss of all progress
when a sync fails mid-pagination due to rate limits or errors.

Changes:
- Add emit_intermediate_state() to ConcurrentCursor
- Extend PaginationTracker with page counting and checkpoint triggering
- Call on_page_complete() in SimpleRetriever._read_pages()
- Wire up checkpoint cursor in model_to_component_factory

Co-Authored-By: gl_anatolii.yatsuk@airbyte.io <gl_anatolii.yatsuk@airbyte.io>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1771602439-intermediate-state-checkpoint#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1771602439-intermediate-state-checkpoint

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

Co-Authored-By: gl_anatolii.yatsuk@airbyte.io <gl_anatolii.yatsuk@airbyte.io>
@github-actions
Copy link

github-actions bot commented Feb 20, 2026

PyTest Results (Fast)

3 881 tests  +12   3 869 ✅ +12   6m 23s ⏱️ -14s
    1 suites ± 0      12 💤 ± 0 
    1 files   ± 0       0 ❌ ± 0 

Results for commit e0ef3eb. ± Comparison against base commit cd7e369.

♻️ This comment has been updated with latest results.

…remental sync cursors

Co-Authored-By: gl_anatolii.yatsuk@airbyte.io <gl_anatolii.yatsuk@airbyte.io>
@github-actions
Copy link

PyTest Results (Full)

3 884 tests   3 872 ✅  10m 49s ⏱️
    1 suites     12 💤
    1 files        0 ❌

Results for commit e0ef3eb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants