Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: expose peek next commit function to python #1937

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

PengLiVectra
Copy link

@PengLiVectra PengLiVectra commented Dec 4, 2023

Description

Expose peek_next_commit to python. Can be used for streaming delta commit changes. peek_next_commit will return actions in the commit and commit version of next commit. If current version is the latest version, it will return None (actions will be None) and the current version. If there is a missed commit, it will skip the missed commit. For example, we have commits 0, 1, 2, 4, 5, peek_next_commit(2) will return commit 4, skip the missed commit 3.

An example of usage:
actions, next_version = delta_table.peek_next_commit(current_version)
If current_version is the latest version, the return will be like None, current_version.

Related Issue(s)

#1886

Testing

Added unit tests that passed.

Breaking-Change

Not a breaking change.

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate crate/core labels Dec 4, 2023
@PengLiVectra PengLiVectra force-pushed the expose-peek-next-commit-to-python branch from 167e954 to 359306a Compare December 4, 2023 14:25
@roeap
Copy link
Collaborator

roeap commented Dec 4, 2023

Haven't looked at the actual PR yet, however there is one thing that we should consider.

We are currently in the process of re-writing the log-replay logic and the new approach does not work in the same way. There will eventually be the need to also have some concept of incrementally updating the state, so some notion of peeking at the next commit may still be present, but may look somewhat different.

I guess what I am saying is that if we expose this, there is not much of a guarantee it can survive the next couple of releases...

@PengLiVectra
Copy link
Author

Haven't looked at the actual PR yet, however there is one thing that we should consider.

We are currently in the process of re-writing the log-replay logic and the new approach does not work in the same way. There will eventually be the need to also have some concept of incrementally updating the state, so some notion of peeking at the next commit may still be present, but may look somewhat different.

I guess what I am saying is that if we expose this, there is not much of a guarantee it can survive the next couple of releases...

I see, thanks for letting me know. Is there a document to explain the re-writing?

@roeap
Copy link
Collaborator

roeap commented Dec 4, 2023

Is there a document to explain the re-writing?

Not a publicly accessible one yet. The basic logic is quite simple though.

  1. We create a LogSegment, which contains the latest checkpoint files if present, and list all log entry files with commit numbers greater then the last checkpoint.
  2. Order the commit entries in reverse order.

After that we iterate through all actions and keep track of the "seen" actions.

@PengLiVectra PengLiVectra force-pushed the expose-peek-next-commit-to-python branch from 359306a to b402f3b Compare December 6, 2023 18:05
@roeap
Copy link
Collaborator

roeap commented Dec 7, 2023

@PengLiVectra - about to start reviving this. To keep things smaller could we update the test table to not contain so many files and only the commits we need for that specific scenario?

@PengLiVectra PengLiVectra force-pushed the expose-peek-next-commit-to-python branch from b402f3b to 005d52a Compare December 11, 2023 15:42
@roeap
Copy link
Collaborator

roeap commented Dec 12, 2023

For example, we have commits 0, 1, 2, 4, 5, peek_next_commit(2) will return commit 4, skip the missed commit 3.

this would be considered a seriously corrupted table. Did you encounter this scenario using delta-rs? Or any other writer for that matter?

@PengLiVectra
Copy link
Author

For example, we have commits 0, 1, 2, 4, 5, peek_next_commit(2) will return commit 4, skip the missed commit 3.

this would be considered a seriously corrupted table. Did you encounter this scenario using delta-rs? Or any other writer for that matter?

Sometimes it might be delete accidentally, in this case, we don't want it to break streaming, so we skip it.

@PengLiVectra
Copy link
Author

@PengLiVectra - about to start reviving this. To keep things smaller could we update the test table to not contain so many files and only the commits we need for that specific scenario?

Deleted most of the files to keep the testing data small.

@rtyler rtyler self-assigned this Feb 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate crate/core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants