Kernel: Support schema evolution through existing withSchema API on T… #4196

amogh-jahagirdar · 2025-02-27T07:38:05Z

…ransactionBuilder

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

amogh-jahagirdar · 2025-02-27T07:38:40Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

+        throw new IllegalArgumentException(
+            "Map field " + field.getName() + " must have exactly two nested IDs");
+      }


tests for this

amogh-jahagirdar · 2025-02-27T07:38:51Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

+        throw new IllegalArgumentException(
+            "Map field " + field.getName() + " cannot contain duplicate nested IDs");
+      }


tests for this

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

…ransactionBuilder

amogh-jahagirdar · 2025-02-27T11:14:19Z

kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/DeltaTableWritesSuite.scala

+        == FieldMetadata.builder().putLong("map.key", 5).putLong("map.value", 6).build())
+    }
+  }
+


Add some tests with struct of struct, struct with map/array field, array of struct/map etc

amogh-jahagirdar · 2025-02-27T11:15:09Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/SchemaUtils.java

+    Map<Integer, StructField> newSchemaIdToField = idToField(newSchema);
+    Map<Integer, StructField> currentSchemaIdToField = idToField(currentSchema);
+    for (Map.Entry<Integer, StructField> newFieldEntry : newSchemaIdToField.entrySet()) {
+      if (!currentSchemaIdToField.containsKey(newFieldEntry.getKey())) {


@tdas validates that a non-nullable field cannot be added.

amogh-jahagirdar · 2025-02-27T11:15:30Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/SchemaUtils.java

+      } else {
+        StructField currentField = currentSchemaIdToField.get(newFieldEntry.getKey());
+        StructField newField = newSchemaIdToField.get(newFieldEntry.getKey());
+        if (newField.getDataType() != currentField.getDataType()) {


Validates no type promotion, this will need to be loosened though as we support those features...

amogh-jahagirdar · 2025-02-27T11:18:29Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

+    fieldIdToPhysicalName.put(columnId, physicalName);
+
+    if (field.getDataType() instanceof MapType) {
+      if (!hasNestedColumnIds(field)) {


Technically the Delta protoocol doesn't specify that delta.columnMapping.nested.ids needs to be set but I believe it's a requirement for Iceberg compat.

I think i just commented about this :) yeah we need to do this for IcebergCompat I think (but we should only do this when it's enabled!)

Can you please clearly highlight in the code these sorts of irregularities? Leave a TODO

i.e. if this is a requirement for iceberg compat but not for column mapping .. that's good to know, we may want to update this code in the future to enable better column mapping compatibility, right?

amogh-jahagirdar · 2025-02-27T11:19:01Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

+    for (Map.Entry<Long, String> field : updatedFieldIdToPhysicalName.entrySet()) {
+      String existingPhysicalName = currentFieldIdToPhysicalName.get(field.getKey());
+      // Found an existing field, verify the physical name is preserved
+      if (existingPhysicalName != null && !existingPhysicalName.equals(field.getValue())) {


Validation that physical names for existing fields are preserved between updates

Not sure if this is already done here, but I think we should also validate that for complex types the nested fieldIds are consistent (unchanged) as well

amogh-jahagirdar · 2025-02-27T11:27:43Z

kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/DeltaTableWritesSuite.scala

+              FieldMetadata.builder().putLong(ColumnMapping.COLUMN_MAPPING_ID_KEY, 4)
+                .putString(ColumnMapping.COLUMN_MAPPING_PHYSICAL_NAME_KEY, "d").build())
+            .add("e", IntegerType.INTEGER, true,
+              FieldMetadata.builder().putLong(ColumnMapping.COLUMN_MAPPING_ID_KEY, 5)
+                .putString(ColumnMapping.COLUMN_MAPPING_PHYSICAL_NAME_KEY, "e").build()), true,


Maybe put some of these field metadata building in some helper so it's easier to read

amogh-jahagirdar · 2025-02-27T11:37:33Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

+    if (!hasPhysicalName(field)) {
+      throw new IllegalArgumentException(
+          String.format(
+              "Column mapping mode is name and field %s is missing physical name",
+              field.getName()));
+    }
+
+    if (!hasColumnId(field)) {
+      throw new IllegalArgumentException(
+          String.format(
+              "Column mapping mode is name and field %s is missing column id", field.getName()));
+    }


Validation that physical names and column ids are defined

need to fix the message since it applies for both column mapping modes

Do we do this after we populate them for the non-DBI connector? We don't expect other connectors to populate their own physical names/column ids

(Referring to for table creation. For this updateSchema API I still think we should maybe make it a separate internal API and restrict it to just usage by DBI -- not opening up schema evolution for all connectors yet)

nicklan

flushing a few small things

nicklan · 2025-02-27T23:42:52Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/TransactionBuilderImpl.java

@@ -204,7 +214,8 @@ public Transaction build(Engine engine) {
        shouldUpdateProtocol,
        maxRetries,
        table.getClock(),
-        getDomainMetadatasToCommit(snapshot));
+        getDomainMetadatasToCommit(snapshot),
+        !isNewTable && updatedSchema);


Can you comment what this parameter is preserveFieldIds

+1 pls; unnamed boolean params are confusing

nicklan · 2025-02-27T23:45:36Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/SchemaUtils.java

@@ -78,6 +79,59 @@ public static void validateSchema(StructType schema, boolean isColumnMappingEnab
    validateSupportedType(schema);
  }

+  public static void validateUpdatedSchema(StructType currentSchema, StructType schema) {


nit: can we rename schema -> newSchema so it's more clear

nicklan · 2025-02-27T23:48:53Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/SchemaUtils.java

+        if (newField.getDataType() != currentField.getDataType()) {
+          throw new IllegalArgumentException(
+              String.format(
+                  "Cannot change existing field %s from %s to %s",


Suggested change

"Cannot change existing field %s from %s to %s",

"Cannot change the type of existing field %s from %s to %s",

nicklan · 2025-02-27T23:49:35Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

+    }
+  }
+
+  private static void validateColumnIds(


comment describing what this function does

allisonport-db

Skimmed it a bit. Can you please add more method docs throughout so it's easier to understand what's happening :)

Also still would like to have a more concrete list of what we are validating, and what we explicitly aren't validating (and why we aren't and the consequences of not doing so).

Thank you!!

allisonport-db · 2025-02-28T01:46:33Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/TransactionBuilderImpl.java

@@ -79,6 +80,7 @@ public TransactionBuilderImpl(TableImpl table, String engineInfo, Operation oper
  @Override
  public TransactionBuilder withSchema(Engine engine, StructType newSchema) {


BI thought we were also discussing restricting this update schema method to a different non-public API? So that the public withSchema method always fails if it's not a new table

allisonport-db · 2025-02-28T01:46:45Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/TransactionBuilderImpl.java

@@ -59,6 +59,7 @@ public class TransactionBuilderImpl implements TransactionBuilder {
  private final Map<String, DomainMetadata> domainMetadatasAdded = new HashMap<>();
  private final Set<String> domainMetadatasRemoved = new HashSet<>();
  private Optional<StructType> schema = Optional.empty();
+  private boolean updatedSchema;


I don't think we need this? If schema is non-empty it was updated

allisonport-db · 2025-02-28T01:48:14Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/TransactionBuilderImpl.java

@@ -204,7 +214,8 @@ public Transaction build(Engine engine) {
        shouldUpdateProtocol,
        maxRetries,
        table.getClock(),
-        getDomainMetadatasToCommit(snapshot));
+        getDomainMetadatasToCommit(snapshot),
+        !isNewTable && updatedSchema);


+1 pls; unnamed boolean params are confusing

allisonport-db · 2025-02-28T01:48:56Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/TransactionBuilderImpl.java

-    Map<String, String> validatedProperties =
-        TableConfig.validateDeltaProperties(tableProperties.orElse(Collections.emptyMap()));
-    Map<String, String> newProperties = metadata.filterOutUnchangedProperties(validatedProperties);

-    ColumnMapping.verifyColumnMappingChange(metadata.getConfiguration(), newProperties, isNewTable);
+    if (tableProperties.isPresent()) {
+      Map<String, String> validatedProperties =
+              TableConfig.validateDeltaProperties(tableProperties.orElse(Collections.emptyMap()));
+      Map<String, String> newProperties = metadata.filterOutUnchangedProperties(validatedProperties);

-    if (!newProperties.isEmpty()) {
+      ColumnMapping.verifyColumnMappingChange(metadata.getConfiguration(), newProperties, isNewTable);
+
+      if (!newProperties.isEmpty()) {
+        shouldUpdateMetadata = true;
+        metadata = metadata.withNewConfiguration(newProperties);
+      }
+    }
+


what are all these changes for?

allisonport-db · 2025-02-28T01:50:23Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/TransactionBuilderImpl.java

+            isNewTable
+                ? tableProperties.orElse(Collections.emptyMap())
+                : snapshot.getMetadata().getConfiguration());


pass new configuration to this method? I'm not sure this is safe since we need to combine any new properties with existing ones in the case of a table property update

allisonport-db · 2025-02-28T01:52:19Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/TransactionImpl.java

-              metadata,
-              ColumnMapping.getColumnMappingMode(metadata.getConfiguration()),
-              isNewTable);
+      if (!preserveFieldIds) {


I don't really understand why we need to predicate on this?

If the fieldIds are present -- we should never update/override them right? that would be incorrect?

Maybe can you explain what preserveFieldIds means and why we use it as !isNewTable && updatedSchema; I think maybe it's just not clear in the code

Aka when we should/shouldn't preserve them

allisonport-db · 2025-02-28T01:53:59Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

@@ -156,6 +156,111 @@ static int findMaxColumnId(StructType schema) {
    return maxColumnId;
  }

+  static void validateColumnIds(StructType currentSchema, StructType updatedSchema) {


method docs for all of these please; we should be able to know what this does without reading the code

allisonport-db · 2025-02-28T01:54:55Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

@vkorukanti do we need to validate somewhere the specific requirements for fieldIds for IcebergCompatV2? Like for complex types. Do we already do that somewhere else?

allisonport-db · 2025-02-28T01:56:26Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

+    fieldIdToPhysicalName.put(columnId, physicalName);
+
+    if (field.getDataType() instanceof MapType) {
+      if (!hasNestedColumnIds(field)) {


I think i just commented about this :) yeah we need to do this for IcebergCompat I think (but we should only do this when it's enabled!)

allisonport-db · 2025-02-28T01:56:46Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/SchemaUtils.java

+    ColumnMapping.validateColumnIds(currentSchema, schema);
+    validateUpdatedSchemaCompatibility(currentSchema, schema);
+  }
+


scottsand-db

Looks great! Left some comments and questions!

scottsand-db · 2025-02-28T01:51:33Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/TransactionBuilderImpl.java

@@ -79,6 +80,7 @@ public TransactionBuilderImpl(TableImpl table, String engineInfo, Operation oper
  @Override
  public TransactionBuilder withSchema(Engine engine, StructType newSchema) {
    this.schema = Optional.of(newSchema); // will be verified as part of the build() call
+    this.updatedSchema = true;


can we remove this and just do this.schema.isPresent?

OR, you can move this to a private def isSchemaUpdate that uses the this.schema.isPresent -- but let's not introduce a new member varible that represents coupled state

scottsand-db · 2025-02-28T01:54:51Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/TransactionBuilderImpl.java

-                + "Schema can only be set on a new table.");
+      boolean columnMappingEnabled = isColumnMappingModeEnabled(mappingMode);
+      if (!columnMappingEnabled && updatedSchema) {
+        throw new IllegalArgumentException(


Is this a temporary restriction or a permanent restriction?

Also -- this should be a KernelException -- the user is performing an action that is invalid against the current table state.

scottsand-db · 2025-02-28T01:57:24Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

@@ -156,6 +156,111 @@ static int findMaxColumnId(StructType schema) {
    return maxColumnId;
  }

+  static void validateColumnIds(StructType currentSchema, StructType updatedSchema) {


(1) is this intentionally package private?
(2) can you add some method docs that say what "validate" means? I'd prefer to read a short method comment than read all this code to get the gist of this

scottsand-db · 2025-02-28T02:06:42Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

+    fieldIdToPhysicalName.put(columnId, physicalName);
+
+    if (field.getDataType() instanceof MapType) {
+      if (!hasNestedColumnIds(field)) {


Can you please clearly highlight in the code these sorts of irregularities? Leave a TODO

i.e. if this is a requirement for iceberg compat but not for column mapping .. that's good to know, we may want to update this code in the future to enable better column mapping compatibility, right?

scottsand-db · 2025-02-28T16:48:58Z

kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/DeltaTableWritesSuite.scala

@@ -1321,6 +1321,369 @@ class DeltaTableWritesSuite extends DeltaTableWriteSuiteBase with ParquetSuiteBa
    }
  }

+  test("Test set schema on existing table") {


nit: for bervity you can ommit the "Test" in your test name

scottsand-db · 2025-02-28T16:52:33Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

+    } else if (field.getDataType() instanceof ArrayType) {
+      if (!hasNestedColumnIds(field)) {
+        throw new IllegalArgumentException(
+            String.format("Array field %s must have exactly 1 nested ID", field.getName()));


Can you make sure we have unit tests for each of these error cases?

scottsand-db · 2025-02-28T16:54:19Z

kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/DeltaTableWritesSuite.scala

+    }
+  }
+
+  test("Test updating schema with adding an array and map type") {


So this test and the first test Test set schema on existing table are the only "positive" cases covered. Are there others we should cover here?

Should renaming, moving, etc. be covered here?

scottsand-db · 2025-02-28T16:58:26Z

kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/DeltaTableWritesSuite.scala

@@ -1321,6 +1321,369 @@ class DeltaTableWritesSuite extends DeltaTableWriteSuiteBase with ParquetSuiteBa
    }
  }

+  test("Test set schema on existing table") {


I think we should split these tests off to their own suite -- what do you think?

Also -- @amogh-jahagirdar -- If you see the classdocs for this test suite, "Transaction commit in this suite IS REQUIRED TO use commitTransaction than .commit" -- the fact that you are not doing this and things are still working fine makes me (a) think these should be in a new suite, and (b) why we need to use commitTransaction ?

cc @vkorukanti

scottsand-db · 2025-02-28T16:59:19Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

@@ -156,6 +156,111 @@ static int findMaxColumnId(StructType schema) {
    return maxColumnId;
  }

+  static void validateColumnIds(StructType currentSchema, StructType updatedSchema) {


Would some sort of schema visitor help here?

allisonport-db

Commenting a few things that I think need to be checked but didn't see specifically (but didn't parse the code line-by-line so could be missing something)

We need to forbid tightening nullability for existing columns. I think you mentioned this was implemented but didn't see it commented so not 100% sure.
We need to forbid dropping partition columns. I tried this in spark SQL and it is not allowed (makes sense!).

(in the future)
3) No new generated columns AND generation expression is unchanged for existing columns.

allisonport-db · 2025-02-28T22:33:26Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

+    for (Map.Entry<Long, String> field : updatedFieldIdToPhysicalName.entrySet()) {
+      String existingPhysicalName = currentFieldIdToPhysicalName.get(field.getKey());
+      // Found an existing field, verify the physical name is preserved
+      if (existingPhysicalName != null && !existingPhysicalName.equals(field.getValue())) {


Not sure if this is already done here, but I think we should also validate that for complex types the nested fieldIds are consistent (unchanged) as well

allisonport-db · 2025-02-28T22:34:33Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java

+    if (!hasPhysicalName(field)) {
+      throw new IllegalArgumentException(
+          String.format(
+              "Column mapping mode is name and field %s is missing physical name",
+              field.getName()));
+    }
+
+    if (!hasColumnId(field)) {
+      throw new IllegalArgumentException(
+          String.format(
+              "Column mapping mode is name and field %s is missing column id", field.getName()));
+    }


Do we do this after we populate them for the non-DBI connector? We don't expect other connectors to populate their own physical names/column ids

(Referring to for table creation. For this updateSchema API I still think we should maybe make it a separate internal API and restrict it to just usage by DBI -- not opening up schema evolution for all connectors yet)

amogh-jahagirdar commented Feb 27, 2025

View reviewed changes

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/ColumnMapping.java Show resolved Hide resolved

amogh-jahagirdar force-pushed the with-schema-for-existing-tables branch from 031ba66 to 93b22a1 Compare February 27, 2025 09:33

Kernel: Support schema evolution through existing withSchema API on T…

4278745

…ransactionBuilder

amogh-jahagirdar force-pushed the with-schema-for-existing-tables branch from 93b22a1 to 4278745 Compare February 27, 2025 10:52

amogh-jahagirdar commented Feb 27, 2025

View reviewed changes

nicklan reviewed Feb 27, 2025

View reviewed changes

allisonport-db reviewed Feb 28, 2025

View reviewed changes

scottsand-db requested changes Feb 28, 2025

View reviewed changes

nicklan requested a review from vkorukanti February 28, 2025 22:09

allisonport-db reviewed Feb 28, 2025

View reviewed changes

	"Cannot change existing field %s from %s to %s",
	"Cannot change the type of existing field %s from %s to %s",

		@@ -79,6 +80,7 @@ public TransactionBuilderImpl(TableImpl table, String engineInfo, Operation oper
		@Override
		public TransactionBuilder withSchema(Engine engine, StructType newSchema) {

Kernel: Support schema evolution through existing withSchema API on T… #4196

Are you sure you want to change the base?

Kernel: Support schema evolution through existing withSchema API on T… #4196

Conversation

amogh-jahagirdar commented Feb 27, 2025

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonport-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottsand-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonport-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment