-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Integrate TableFeature
framework in read and write path
#4159
Conversation
TableFeature
framework in read and write path
kernel/kernel-api/src/main/java/io/delta/kernel/internal/TransactionBuilderImpl.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/TransactionBuilderImpl.java
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
// Important note: uses the default implementation of `equals` and `hashCode` methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, but would it make sense to move this comment to the class javadoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
private static class CheckpointV2TableFeature extends TableFeature.ReaderWriterFeature | ||
implements FeatureAutoEnabledByMetadata { | ||
CheckpointV2TableFeature() { | ||
super("v2Checkpoint", /* minReaderVersion = */ 3, /* minWriterVersion = */ 7); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: would it make sense to use the constants like TABLE_FEATURES_MIN_READER_VERSION
etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. I think keeping constants is fine here, given some of the legacy protocols (e.g. invariants ) have no constants.
* | ||
* @param <V> | ||
*/ | ||
public class CaseInsensitiveMap<V> implements Map<String, V> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to call it something like CaseInsensitiveStringMap
since the key is always a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Will followup in a separate PR. I am also wondering if this is really needed.
* | ||
* @param <V> | ||
*/ | ||
public class CaseInsensitiveMap<V> implements Map<String, V> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to implement equals
and hashCode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Left some minor comments and asked some questions.
kernel/kernel-api/src/main/java/io/delta/kernel/internal/DeltaErrors.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/DeltaErrors.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/TransactionBuilderImpl.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/tablefeatures/TableFeatures.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/tablefeatures/TableFeatures.java
Outdated
Show resolved
Hide resolved
.withFeatures(extractAutomaticallyEnabledFeatures(metadata, current)) | ||
.normalized(); | ||
|
||
if (!required.canUpgradeTo(current)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't we want to see if the currProtocol
can be upgraded to the requiredProtocol
?
kernel/kernel-api/src/main/java/io/delta/kernel/internal/tablefeatures/TableFeatures.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/tablefeatures/TableFeatures.java
Outdated
Show resolved
Hide resolved
4e60a26
to
745c4a4
Compare
745c4a4
to
63d4376
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
...el/kernel-api/src/test/scala/io/delta/kernel/internal/tablefeatures/TableFeaturesSuite.scala
Show resolved
Hide resolved
kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/DeltaTableWritesSuite.scala
Outdated
Show resolved
Hide resolved
kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/DeltaTableWritesSuite.scala
Outdated
Show resolved
Hide resolved
@@ -231,30 +231,6 @@ class RowTrackingSuite extends DeltaTableWriteSuiteBase with ParquetSuiteBase { | |||
} | |||
} | |||
|
|||
test("Fail if row tracking is supported but domain metadata is not supported") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @qiyuandong-db @johanl-db Removing this based on the offline discussion. This scenario is not possible with the new table feature framework.
@@ -208,7 +209,11 @@ private static class RowTrackingFeature extends TableFeature.WriterFeature | |||
|
|||
@Override | |||
public boolean metadataRequiresFeatureToBeEnabled(Protocol protocol, Metadata metadata) { | |||
return TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata); | |||
if (TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @qiyuandong-db @johanl-db Disabling this based on the offline discussion, given Kernel doesn't support enabling row tracking through metadata props.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to clarify, this means we don't support enabling on an existing table right? but we do support enabling on a new table? Can we maybe restate this so it's more user friendly? I don't think they'll understand enabling a feature through metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even on the new table is not supported. Currently the way it is set is through passing a protocol action in the Txn.commit along with the file actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so it would be correct to say - for users - enabling row tracking is not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiyuandong-db @johanl-db please chime in on if there is any other way to enable row tracking currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry now I'm confused. We support writing to tables with rowTracking already enabled right? We just don't support enabling rowTracking. But maybe I'm misunderstanding the current status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave it up to @qiyuandong-db and @johanl-db to update the status. It doesn't make sense to go back and forth on the error message and block this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay... @johanl-db @qiyuandong-db can you please create a ticket to follow up on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we already have a ticket to track this and we'll want to follow up on it this quarter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the test code now
break; | ||
default: | ||
throw unsupportedWriterProtocol(tablePath, minWriterVersion); | ||
public static Optional<Tuple2<Protocol, Set<TableFeature>>> autoUpgradeProtocolBasedOnMetadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what we currently do in Delta Spark? Automatically upgrade to (3, 7) when enabling something like column mapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering, is it overkill, to do this every write operation? Even if there is no metadata change? (just wondering, I know it's what we are already doing before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what we currently do in Delta Spark? Automatically upgrade to (3, 7) when enabling something like column mapping?
Goal is to set the minimal protocol version that supports all the supported features or identity columns. If we just set column mapping and not feature from (3,7), then the protocol version is going to be (2, 5) as that's the lowest version that supports column mapping.
I'm wondering, is it overkill, to do this every write operation? Even if there is no metadata change? (just wondering, I know it's what we are already doing before)
There are like <20 table features and running through them takes few ms. Also it is not easy to find what metadata changes are we interested in (this info is encapsulated in the TableFEatures definitions). If we want to avoid calling this, then we need to repeat the same code (find if interested metadata has changes) here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we have existing tests for all these upgrades somewhere? Do you know where that is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goal is to set the minimal protocol version that supports all the supported features or identity columns. If we just set column mapping and not feature from (3,7), then the protocol version is going to be (2, 5) as that's the lowest version that supports column mapping.
Maybe I'm confused but I was thinking that's not what it looks like we do? We have
Protocol required =
new Protocol(TABLE_FEATURES_MIN_READER_VERSION, TABLE_FEATURES_MIN_WRITER_VERSION)
.withFeatures(extractAutomaticallyEnabledNewFeatures(newMetadata, currentProtocol))
.normalized();
which we use to merge and I see the docs for that are
/**
* Merge this protocol with multiple `protocols` to have the highest reader and writer versions
* plus all explicitly and implicitly supported features.
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We start with protocol with feature support, add all the features and then normalize it to be the lowest needed protocol version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like that will keep the table feature protocol though for the scenario like just enabling columnMapping
? Since we don't have the other features in the legacy versions.
I'm totally fine with this being the behavior I'm just wondering if this what we do in delta-spark? And want to make sure we test our upgrades
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added unittests to clarify the expected output through test examples.
kernel/kernel-api/src/main/java/io/delta/kernel/internal/DeltaErrors.java
Show resolved
Hide resolved
@@ -208,7 +209,11 @@ private static class RowTrackingFeature extends TableFeature.WriterFeature | |||
|
|||
@Override | |||
public boolean metadataRequiresFeatureToBeEnabled(Protocol protocol, Metadata metadata) { | |||
return TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata); | |||
if (TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to clarify, this means we don't support enabling on an existing table right? but we do support enabling on a new table? Can we maybe restate this so it's more user friendly? I don't think they'll understand enabling a feature through metadata
kernel/kernel-api/src/main/java/io/delta/kernel/internal/tablefeatures/TableFeatures.java
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/tablefeatures/TableFeatures.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/tablefeatures/TableFeatures.java
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/tablefeatures/TableFeatures.java
Outdated
Show resolved
Hide resolved
...el/kernel-api/src/test/scala/io/delta/kernel/internal/tablefeatures/TableFeaturesSuite.scala
Outdated
Show resolved
Hide resolved
...el/kernel-api/src/test/scala/io/delta/kernel/internal/tablefeatures/TableFeaturesSuite.scala
Show resolved
Hide resolved
...el/kernel-api/src/test/scala/io/delta/kernel/internal/tablefeatures/TableFeaturesSuite.scala
Show resolved
Hide resolved
...el/kernel-api/src/test/scala/io/delta/kernel/internal/tablefeatures/TableFeaturesSuite.scala
Show resolved
Hide resolved
*/ | ||
class DeltaTableFeaturesSuite extends DeltaTableWriteSuiteBase { | ||
|
||
/////////////////////////////////////////////////////////////////////////// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have tests for things like if I try to enable CDF we will not be able to write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the unittests in TableFeaturesSuite.scala for each feature (supported, not enabled) and (supported, enabled) cases.
I haven't added the integration tests. I feel the unittests are good enough to cover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah those are very thorough! Can we add 1 test for 1 feature that's enabled by metadata but not supported by Kernel just to check the E2E integration & path through TransactionBuilderImpl? Like just this CDF enable case will fail
kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/DeltaTableWritesSuite.scala
Show resolved
Hide resolved
kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/InCommitTimestampSuite.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see tests for metadata-based protocol upgrades unless they're just feature-by-feature?
…delta commit files (delta-io#4182) ## Description See for delta-io#3888 for details. Some versions of the Delta-Spark written tables with commit info containing arbitrary value types in the map of s`tring -> string`. This has existed for a while. Update the Kernel default delta commit file reader to always try to parse as a `string` if the value type is a `string` type. This is not ideal, but no other easy ways to handle this. ## How was this patch tested? UT.
9b08caf
to
30127c3
Compare
318b6bf
to
83c12fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor comments.
Also still wondering if we already have tests for the metadata-based automatic upgrades? and if not we should test our upgrade logic
@@ -208,7 +209,11 @@ private static class RowTrackingFeature extends TableFeature.WriterFeature | |||
|
|||
@Override | |||
public boolean metadataRequiresFeatureToBeEnabled(Protocol protocol, Metadata metadata) { | |||
return TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata); | |||
if (TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 let's make this error message clear and user-friendly
break; | ||
default: | ||
throw unsupportedWriterProtocol(tablePath, minWriterVersion); | ||
public static Optional<Tuple2<Protocol, Set<TableFeature>>> autoUpgradeProtocolBasedOnMetadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like that will keep the table feature protocol though for the scenario like just enabling columnMapping
? Since we don't have the other features in the legacy versions.
I'm totally fine with this being the behavior I'm just wondering if this what we do in delta-spark? And want to make sure we test our upgrades
kernel/kernel-api/src/main/java/io/delta/kernel/internal/tablefeatures/TableFeatures.java
Outdated
Show resolved
Hide resolved
...el/kernel-api/src/test/scala/io/delta/kernel/internal/tablefeatures/TableFeaturesSuite.scala
Show resolved
Hide resolved
...el/kernel-api/src/test/scala/io/delta/kernel/internal/tablefeatures/TableFeaturesSuite.scala
Show resolved
Hide resolved
add("v2Checkpoint"); | ||
} | ||
}); | ||
Set<TableFeature> autoEnabledFeatures = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should decide if this is only newAutoEnabledFeatures
(in which case we should only add domain_metadata below if it's not already there) or all autoEnabledFeatures
and update the name as well
I think this is fine to be all autoEnabledFeatures right? maybe update extractAutomaticallyEnabledNewFeatures
to not remove the existing features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes more semantic sense - create the protocol we require for this table
kernel/kernel-api/src/main/java/io/delta/kernel/internal/tablefeatures/TableFeatures.java
Outdated
Show resolved
Hide resolved
*/ | ||
class DeltaTableFeaturesSuite extends DeltaTableWriteSuiteBase { | ||
|
||
/////////////////////////////////////////////////////////////////////////// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah those are very thorough! Can we add 1 test for 1 feature that's enabled by metadata but not supported by Kernel just to check the E2E integration & path through TransactionBuilderImpl? Like just this CDF enable case will fail
5c96f68
to
e8ff801
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks the autoUpgradeProtocolBasedOnMetadata
tests help a ton and are super thorough! sorry a few more questions
@@ -208,7 +209,11 @@ private static class RowTrackingFeature extends TableFeature.WriterFeature | |||
|
|||
@Override | |||
public boolean metadataRequiresFeatureToBeEnabled(Protocol protocol, Metadata metadata) { | |||
return TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata); | |||
if (TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry now I'm confused. We support writing to tables with rowTracking already enabled right? We just don't support enabling rowTracking. But maybe I'm misunderstanding the current status?
kernel/kernel-api/src/main/java/io/delta/kernel/internal/actions/Protocol.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/tablefeatures/TableFeatures.java
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/tablefeatures/TableFeatures.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -429,7 +429,7 @@ public static Optional<Tuple2<Protocol, Set<TableFeature>>> autoUpgradeProtocolB | |||
|
|||
Set<TableFeature> allNeededTableFeatures = | |||
extractAllNeededTableFeatures(newMetadata, currentProtocol); | |||
if (needDomainMetadataSupport && !allNeededTableFeatures.contains(DOMAIN_METADATA_W_FEATURE)) { | |||
if (needDomainMetadataSupport) { | |||
allNeededTableFeatures = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not blocking but why not just allNeededTableFreatures.add(domainMetadata)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be mutable Set given we are using Streams
@@ -208,7 +209,11 @@ private static class RowTrackingFeature extends TableFeature.WriterFeature | |||
|
|||
@Override | |||
public boolean metadataRequiresFeatureToBeEnabled(Protocol protocol, Metadata metadata) { | |||
return TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata); | |||
if (TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay... @johanl-db @qiyuandong-db can you please create a ticket to follow up on this?
if (tableProperties.isPresent()) { | ||
Map<String, String> validatedProperties = | ||
TableConfig.validateDeltaProperties(tableProperties.get()); | ||
Map<String, String> newProperties = | ||
metadata.filterOutUnchangedProperties(validatedProperties); | ||
|
||
ColumnMapping.verifyColumnMappingChange( | ||
metadata.getConfiguration(), newProperties, isNewTable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit late to this PR but was this change to do this logic even when there wasn't a table properties change intentional? I think it'll fail the verifyColumnMappingChange check below when it shouldn't , in the case a user didn't define table properties on a subsequent transaction (the new properties will be empty, and then the column mode will be considered as null even though it's set on the current state of the table). Unless I'm missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, TLDR if I'm not mistaken I think we still need this validation only in the condition that tableProperties.isPresent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do want to run through the protocol changes even when the table properties are empty, because there could be schema changes that we want to consider (e.g. the schema may have timestamp_ntz in which case we need to enabled timestampNtz table feature).
However the ColumnMapping.verifyColumnMappingChange( metadata.getConfiguration(), newProperties, isNewTable)
should take the entire newConfiguration instead of just taking the changed properties. This was the case even before the this PR. It is possible that the new table properties may not contain the column mapping mode.
I will make a PR to fix this. Also need to check why the tests didn't catch this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description
This integrates the end-to-end TableFeatures framework in read and write path.
It also adds support for appending into a table with
deletionVectors
,v2Checkpoint
and 'timestampNtz` are supported.How was this patch tested?
Existing and new tests.