Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
f8bc074
new Snapshot::new_from() API
zachschuermann Nov 27, 2024
3d37288
Merge remote-tracking branch 'upstream/main' into snapshot-from-snapshot
zachschuermann Mar 18, 2025
5480711
incremental snapshot update, log segment, table config
zachschuermann Mar 18, 2025
2686fb3
incremental log segment
zachschuermann Mar 20, 2025
0daf1e9
Merge remote-tracking branch 'upstream/main' into snapshot-from-snapshot
zachschuermann Mar 20, 2025
692c9d5
tests
zachschuermann Mar 21, 2025
f18e380
Merge remote-tracking branch 'upstream/main' into snapshot-from-snapshot
zachschuermann Mar 21, 2025
8d3357b
refactor
zachschuermann Mar 21, 2025
3bf3d67
nits
zachschuermann Mar 21, 2025
453db1b
not pleased with these tests
zachschuermann Mar 25, 2025
ce31322
fix
zachschuermann Mar 25, 2025
f1578fb
Merge remote-tracking branch 'upstream/main' into snapshot-from-snapshot
zachschuermann Mar 25, 2025
ee61b75
docs
zachschuermann Mar 25, 2025
fea0f76
incremental table config test
zachschuermann Mar 26, 2025
81f61ae
comment
zachschuermann Mar 26, 2025
7b0bd1c
fix to include old checkpoint parts
zachschuermann Mar 26, 2025
691b23a
Merge branch 'main' into snapshot-from-snapshot
zachschuermann Mar 26, 2025
27c4a95
Merge branch 'main' into snapshot-from-snapshot
zachschuermann Mar 27, 2025
66e44d2
Merge remote-tracking branch 'upstream/main' into snapshot-from-snapshot
zachschuermann Mar 28, 2025
46ab944
do list_from last checkpoint version
zachschuermann Mar 28, 2025
5a582d6
quick nit
zachschuermann Mar 28, 2025
592667b
few nits
zachschuermann Apr 4, 2025
75b6178
fix
zachschuermann Apr 4, 2025
5cdde76
Merge remote-tracking branch 'upstream/main' into snapshot-from-snapshot
zachschuermann Apr 4, 2025
a94ddef
clippy
zachschuermann Apr 4, 2025
e8e327d
redo the 'no new commits' check
zachschuermann Apr 7, 2025
d5bcb67
match instead of if chain
zachschuermann Apr 7, 2025
be88b88
nits and use tableconfig::try_new in try_new_from
zachschuermann Apr 7, 2025
d396953
Merge remote-tracking branch 'upstream/main' into snapshot-from-snapshot
zachschuermann Apr 7, 2025
6e49bc2
fix merging main
zachschuermann Apr 7, 2025
032d72b
fix deref
zachschuermann Apr 7, 2025
cb06927
Merge remote-tracking branch 'upstream/main' into snapshot-from-snapshot
zachschuermann Apr 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 41 additions & 1 deletion kernel/src/snapshot.rs
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,8 @@ impl Snapshot {
///
/// - `table_root`: url pointing at the table root (where `_delta_log` folder is located)
/// - `engine`: Implementation of [`Engine`] apis.
/// - `version`: target version of the [`Snapshot`]
/// - `version`: target version of the [`Snapshot`]. None will create a snapshot at the latest
/// version of the table.
pub fn try_new(
table_root: Url,
engine: &dyn Engine,
Expand All @@ -71,6 +72,26 @@ impl Snapshot {
Self::try_new_from_log_segment(table_root, log_segment, engine)
}

/// Create a new [`Snapshot`] instance from an existing [`Snapshot`]. This is useful when you
/// already have a [`Snapshot`] lying around and want to do the minimal work to 'update' the
/// snapshot to a later version.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, is this api only for versions later than the existing snapshot?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep for now proposing that we allow old snapshot but just return a new snapshot (no incrementalization) maybe warn! in that case? or i suppose we could disallow that..?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any valid scenario where a caller could legitimately pass a newer snapshot than the one they're asking for? I guess time travel? But if they know they're time traveling why would they pass a newer snapshot in the first place?

Either way, we should publicly document whether a too-new starting snapshot is an error or merely a useless hint, so callers don't have to wonder.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any optimization that's available (let alone useful) if the caller passes in a new snapshot as the hint.

If that's true, then the question is: do we prohibit this behavior or just let it degenerate to the usual try_new the client should have done anyways?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for returning an error in that case. It's unlikely the engine meant to get into that situation, so let's let them know they are doing something wrong

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to be an error now! i agree :)

///
/// # Parameters
///
/// - `existing_snapshot`: reference to an existing [`Snapshot`]
/// - `engine`: Implementation of [`Engine`] apis.
/// - `version`: target version of the [`Snapshot`]. None will create a snapshot at the latest
/// version of the table.
pub fn new_from(
existing_snapshot: &Snapshot,
engine: &dyn Engine,
version: Option<Version>,
) -> DeltaResult<Self> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the method should take+return Arc<Snapshot> so we have the option to return the same snapshot if we determine it is still fresh?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe even do

pub fn refresh(self: &Arc<Self>, ...) -> DeltaResult<Arc<Self>>

(this would have slightly different intuition than new_from -- refresh specifically assumes I want a newer snapshot, if available, and attempting to request an older version may not even be legal; I'm not sure if it would even make sense to pass an upper bound version for a refresh operation)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified to take + return Arc<Snapshot> but i've avoided calling it refresh since that feels to me like implying mutability? I'm in favor of new_from since that's saying you get a new snapshot but just 'from' an older one. let me know if you agree with that thinking!

// TODO(zach): for now we just pass through to the old API. We should instead optimize this
// to avoid replaying overlapping LogSegments.
Self::try_new(existing_snapshot.table_root.clone(), engine, version)
}

/// Create a new [`Snapshot`] instance.
pub(crate) fn try_new_from_log_segment(
location: Url,
Expand Down Expand Up @@ -250,6 +271,25 @@ mod tests {
assert_eq!(snapshot.schema(), &expected);
}

#[test]
fn test_snapshot_new_from() {
let path =
std::fs::canonicalize(PathBuf::from("./tests/data/table-with-dv-small/")).unwrap();
let url = url::Url::from_directory_path(path).unwrap();

let engine = SyncEngine::new();
let old_snapshot = Snapshot::try_new(url, &engine, Some(0)).unwrap();
let snapshot = Snapshot::new_from(&old_snapshot, &engine, Some(0)).unwrap();

let expected =
Protocol::try_new(3, 7, Some(["deletionVectors"]), Some(["deletionVectors"])).unwrap();
assert_eq!(snapshot.protocol(), &expected);

let schema_string = r#"{"type":"struct","fields":[{"name":"value","type":"integer","nullable":true,"metadata":{}}]}"#;
let expected: StructType = serde_json::from_str(schema_string).unwrap();
assert_eq!(snapshot.schema(), &expected);
}

#[test]
fn test_read_table_with_last_checkpoint() {
let path = std::fs::canonicalize(PathBuf::from(
Expand Down
2 changes: 1 addition & 1 deletion kernel/src/table_changes/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ impl TableChanges {
// supported for every protocol action in the CDF range.
let start_snapshot =
Snapshot::try_new(table_root.as_url().clone(), engine, Some(start_version))?;
let end_snapshot = Snapshot::try_new(table_root.as_url().clone(), engine, end_version)?;
let end_snapshot = Snapshot::new_from(&start_snapshot, engine, end_version)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This opens an interesting question... if we knew that new_from would reuse the log checkpoint and just "append" any new commit .json files to the log segment, then we could almost (**) reuse that log segment for the CDF replay by just stripping out its checkpoint files? But that's pretty CDF specific; in the normal case we want a refresh to use the newest checkpoint available because it makes data skipping log replay cheaper. Maybe the CDF case needs a completely different way of creating the end_snapshot, unrelated to this optimization here.

(**) Almost, because the start version might have a checkpoint, in which case stripping the checkpoint out of the log segment would also remove the start version. But then again, do we actually want the older snapshot to be the start version? Or the previous version which the start version is making changes to? Or, maybe we should just restrict the checkpoint search to versions before the start version, specifically so that this optimization can work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually want the older snapshot to be the start version?

It would be sufficient to have the older snapshot be start_version-1 as long as we also have access to the commit at start_version. With these, we would start P&M at start_version then continue it on the older snapshot if we don't find anything.

I guess this would look like: snapshot(start_version-1).refresh_with_commits(end_version)

After all, the goal of the start_snapshot is just to ensure that CDF is enabled.


// Verify CDF is enabled at the beginning and end of the interval to fail early. We must
// still check that CDF is enabled for every metadata action in the CDF range.
Expand Down