Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Integrate TableFeature framework in read and write path #4159

Merged
merged 21 commits into from
Feb 26, 2025

Conversation

vkorukanti
Copy link
Collaborator

@vkorukanti vkorukanti commented Feb 14, 2025

Description

This integrates the end-to-end TableFeatures framework in read and write path.

It also adds support for appending into a table with deletionVectors, v2Checkpoint and 'timestampNtz` are supported.

How was this patch tested?

Existing and new tests.

@vkorukanti vkorukanti changed the title Tf integration [Kernel] Integrate TableFeature framework in read and write path Feb 14, 2025
}
}

// Important note: uses the default implementation of `equals` and `hashCode` methods.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but would it make sense to move this comment to the class javadoc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private static class CheckpointV2TableFeature extends TableFeature.ReaderWriterFeature
implements FeatureAutoEnabledByMetadata {
CheckpointV2TableFeature() {
super("v2Checkpoint", /* minReaderVersion = */ 3, /* minWriterVersion = */ 7);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: would it make sense to use the constants like TABLE_FEATURES_MIN_READER_VERSION etc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. I think keeping constants is fine here, given some of the legacy protocols (e.g. invariants ) have no constants.

*
* @param <V>
*/
public class CaseInsensitiveMap<V> implements Map<String, V> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to call it something like CaseInsensitiveStringMap since the key is always a string?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Will followup in a separate PR. I am also wondering if this is really needed.

*
* @param <V>
*/
public class CaseInsensitiveMap<V> implements Map<String, V> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to implement equals and hashCode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Left some minor comments and asked some questions.

.withFeatures(extractAutomaticallyEnabledFeatures(metadata, current))
.normalized();

if (!required.canUpgradeTo(current)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't we want to see if the currProtocol can be upgraded to the requiredProtocol ?

Copy link
Collaborator

@anoopj anoopj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -231,30 +231,6 @@ class RowTrackingSuite extends DeltaTableWriteSuiteBase with ParquetSuiteBase {
}
}

test("Fail if row tracking is supported but domain metadata is not supported") {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @qiyuandong-db @johanl-db Removing this based on the offline discussion. This scenario is not possible with the new table feature framework.

@@ -208,7 +209,11 @@ private static class RowTrackingFeature extends TableFeature.WriterFeature

@Override
public boolean metadataRequiresFeatureToBeEnabled(Protocol protocol, Metadata metadata) {
return TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata);
if (TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata)) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @qiyuandong-db @johanl-db Disabling this based on the offline discussion, given Kernel doesn't support enabling row tracking through metadata props.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify, this means we don't support enabling on an existing table right? but we do support enabling on a new table? Can we maybe restate this so it's more user friendly? I don't think they'll understand enabling a feature through metadata

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even on the new table is not supported. Currently the way it is set is through passing a protocol action in the Txn.commit along with the file actions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so it would be correct to say - for users - enabling row tracking is not supported?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiyuandong-db @johanl-db please chime in on if there is any other way to enable row tracking currently.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the error message

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry now I'm confused. We support writing to tables with rowTracking already enabled right? We just don't support enabling rowTracking. But maybe I'm misunderstanding the current status?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave it up to @qiyuandong-db and @johanl-db to update the status. It doesn't make sense to go back and forth on the error message and block this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay... @johanl-db @qiyuandong-db can you please create a ticket to follow up on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we already have a ticket to track this and we'll want to follow up on it this quarter.

Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the test code now

break;
default:
throw unsupportedWriterProtocol(tablePath, minWriterVersion);
public static Optional<Tuple2<Protocol, Set<TableFeature>>> autoUpgradeProtocolBasedOnMetadata(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what we currently do in Delta Spark? Automatically upgrade to (3, 7) when enabling something like column mapping?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering, is it overkill, to do this every write operation? Even if there is no metadata change? (just wondering, I know it's what we are already doing before)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what we currently do in Delta Spark? Automatically upgrade to (3, 7) when enabling something like column mapping?

Goal is to set the minimal protocol version that supports all the supported features or identity columns. If we just set column mapping and not feature from (3,7), then the protocol version is going to be (2, 5) as that's the lowest version that supports column mapping.

I'm wondering, is it overkill, to do this every write operation? Even if there is no metadata change? (just wondering, I know it's what we are already doing before)

There are like <20 table features and running through them takes few ms. Also it is not easy to find what metadata changes are we interested in (this info is encapsulated in the TableFEatures definitions). If we want to avoid calling this, then we need to repeat the same code (find if interested metadata has changes) here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we have existing tests for all these upgrades somewhere? Do you know where that is?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goal is to set the minimal protocol version that supports all the supported features or identity columns. If we just set column mapping and not feature from (3,7), then the protocol version is going to be (2, 5) as that's the lowest version that supports column mapping.

Maybe I'm confused but I was thinking that's not what it looks like we do? We have

    Protocol required =
        new Protocol(TABLE_FEATURES_MIN_READER_VERSION, TABLE_FEATURES_MIN_WRITER_VERSION)
            .withFeatures(extractAutomaticallyEnabledNewFeatures(newMetadata, currentProtocol))
            .normalized();

which we use to merge and I see the docs for that are

  /**
   * Merge this protocol with multiple `protocols` to have the highest reader and writer versions
   * plus all explicitly and implicitly supported features.
   */

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We start with protocol with feature support, add all the features and then normalize it to be the lowest needed protocol version.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like that will keep the table feature protocol though for the scenario like just enabling columnMapping? Since we don't have the other features in the legacy versions.

I'm totally fine with this being the behavior I'm just wondering if this what we do in delta-spark? And want to make sure we test our upgrades

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added unittests to clarify the expected output through test examples.

@@ -208,7 +209,11 @@ private static class RowTrackingFeature extends TableFeature.WriterFeature

@Override
public boolean metadataRequiresFeatureToBeEnabled(Protocol protocol, Metadata metadata) {
return TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata);
if (TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify, this means we don't support enabling on an existing table right? but we do support enabling on a new table? Can we maybe restate this so it's more user friendly? I don't think they'll understand enabling a feature through metadata

*/
class DeltaTableFeaturesSuite extends DeltaTableWriteSuiteBase {

///////////////////////////////////////////////////////////////////////////
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests for things like if I try to enable CDF we will not be able to write?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the unittests in TableFeaturesSuite.scala for each feature (supported, not enabled) and (supported, enabled) cases.

I haven't added the integration tests. I feel the unittests are good enough to cover.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah those are very thorough! Can we add 1 test for 1 feature that's enabled by metadata but not supported by Kernel just to check the E2E integration & path through TransactionBuilderImpl? Like just this CDF enable case will fail

Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see tests for metadata-based protocol upgrades unless they're just feature-by-feature?

vkorukanti and others added 2 commits February 25, 2025 00:46
…delta commit files (delta-io#4182)

## Description
See for delta-io#3888 for details. 

Some versions of the Delta-Spark written tables with commit info
containing arbitrary value types in the map of s`tring -> string`. This
has existed for a while. Update the Kernel default delta commit file
reader to always try to parse as a `string` if the value type is a
`string` type. This is not ideal, but no other easy ways to handle this.

## How was this patch tested?
UT.
Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments.

Also still wondering if we already have tests for the metadata-based automatic upgrades? and if not we should test our upgrade logic

@@ -208,7 +209,11 @@ private static class RowTrackingFeature extends TableFeature.WriterFeature

@Override
public boolean metadataRequiresFeatureToBeEnabled(Protocol protocol, Metadata metadata) {
return TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata);
if (TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 let's make this error message clear and user-friendly

break;
default:
throw unsupportedWriterProtocol(tablePath, minWriterVersion);
public static Optional<Tuple2<Protocol, Set<TableFeature>>> autoUpgradeProtocolBasedOnMetadata(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like that will keep the table feature protocol though for the scenario like just enabling columnMapping? Since we don't have the other features in the legacy versions.

I'm totally fine with this being the behavior I'm just wondering if this what we do in delta-spark? And want to make sure we test our upgrades

add("v2Checkpoint");
}
});
Set<TableFeature> autoEnabledFeatures =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should decide if this is only newAutoEnabledFeatures (in which case we should only add domain_metadata below if it's not already there) or all autoEnabledFeatures and update the name as well

I think this is fine to be all autoEnabledFeatures right? maybe update extractAutomaticallyEnabledNewFeatures to not remove the existing features

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes more semantic sense - create the protocol we require for this table

*/
class DeltaTableFeaturesSuite extends DeltaTableWriteSuiteBase {

///////////////////////////////////////////////////////////////////////////
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah those are very thorough! Can we add 1 test for 1 feature that's enabled by metadata but not supported by Kernel just to check the E2E integration & path through TransactionBuilderImpl? Like just this CDF enable case will fail

Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks the autoUpgradeProtocolBasedOnMetadata tests help a ton and are super thorough! sorry a few more questions

@@ -208,7 +209,11 @@ private static class RowTrackingFeature extends TableFeature.WriterFeature

@Override
public boolean metadataRequiresFeatureToBeEnabled(Protocol protocol, Metadata metadata) {
return TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata);
if (TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry now I'm confused. We support writing to tables with rowTracking already enabled right? We just don't support enabling rowTracking. But maybe I'm misunderstanding the current status?

Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -429,7 +429,7 @@ public static Optional<Tuple2<Protocol, Set<TableFeature>>> autoUpgradeProtocolB

Set<TableFeature> allNeededTableFeatures =
extractAllNeededTableFeatures(newMetadata, currentProtocol);
if (needDomainMetadataSupport && !allNeededTableFeatures.contains(DOMAIN_METADATA_W_FEATURE)) {
if (needDomainMetadataSupport) {
allNeededTableFeatures =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not blocking but why not just allNeededTableFreatures.add(domainMetadata)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be mutable Set given we are using Streams

@@ -208,7 +209,11 @@ private static class RowTrackingFeature extends TableFeature.WriterFeature

@Override
public boolean metadataRequiresFeatureToBeEnabled(Protocol protocol, Metadata metadata) {
return TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata);
if (TableConfig.ROW_TRACKING_ENABLED.fromMetadata(metadata)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay... @johanl-db @qiyuandong-db can you please create a ticket to follow up on this?

@vkorukanti vkorukanti merged commit 602edad into delta-io:master Feb 26, 2025
21 checks passed
Comment on lines -168 to -175
if (tableProperties.isPresent()) {
Map<String, String> validatedProperties =
TableConfig.validateDeltaProperties(tableProperties.get());
Map<String, String> newProperties =
metadata.filterOutUnchangedProperties(validatedProperties);

ColumnMapping.verifyColumnMappingChange(
metadata.getConfiguration(), newProperties, isNewTable);
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit late to this PR but was this change to do this logic even when there wasn't a table properties change intentional? I think it'll fail the verifyColumnMappingChange check below when it shouldn't , in the case a user didn't define table properties on a subsequent transaction (the new properties will be empty, and then the column mode will be considered as null even though it's set on the current state of the table). Unless I'm missing something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, TLDR if I'm not mistaken I think we still need this validation only in the condition that tableProperties.isPresent

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do want to run through the protocol changes even when the table properties are empty, because there could be schema changes that we want to consider (e.g. the schema may have timestamp_ntz in which case we need to enabled timestampNtz table feature).

However the ColumnMapping.verifyColumnMappingChange( metadata.getConfiguration(), newProperties, isNewTable) should take the entire newConfiguration instead of just taking the changed properties. This was the case even before the this PR. It is possible that the new table properties may not contain the column mapping mode.

I will make a PR to fix this. Also need to check why the tests didn't catch this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants