[SPARK-50883][SQL] Support altering multiple columns in the same command #49559

ctring · 2025-01-17T20:13:40Z

What changes were proposed in this pull request?

We propose the following new syntax for altering multiple columns at the same time:

ALTER TABLE table_name ALTER COLUMN { 
  { column_identifier | field_name }
  { COMMENT comment |
  { FIRST | AFTER identifier } |
  { SET | DROP } NOT NULL |
  TYPE data_type |
  SET DEFAULT clause |
  DROP DEFAULT }
} [, ...]

For example:

ALTER TABLE test_table ALTER COLUMN
  a COMMENT "new comment",
  b TYPE BIGINT,
  x.y.z FIRST

This new syntax is backward compatible with the current syntax. To bound the complexity of the initial support of this syntax we place the following restrictions:

Altering the same column multiple times is not allowed
Altering a parent and a child column (for nested data type) is not allowed.
Altering v1 tables with this new syntax is not allowed.

In terms of implementation, we modify the current AlterColumn logical plan to be AlterColumns that can take in multiple columns and AlterColumnSpecs.

All AlterColumnSpecs are checked during analyzing phase, so if one of them is invalid (e.g., non-existent column, wrong type conversion, etc), the entire command will fail.

The AlterColumnSpecs are transformed into TableChanges, which are passed to the TableCatalog::alterTable method. Therefore, the semantics of this new command (atomic vs non-atomic) depends on the implementation of this method.

The V2SessionCatalog::alterTable currently applies all table changes to the catalog table and then send to the catalog in one request. As a result, column changes are by default applied to the catalog (HMS) atomically: either all changes are made or none are.

For example, the above command produces the following plans:

== Physical Plan ==
AlterTable org.apache.spark.sql.delta.catalog.DeltaCatalog@6d89c923, default.test_table, [org.apache.spark.sql.connector.catalog.TableChange$UpdateColumnComment@ff58ec42, org.apache.spark.sql.connector.catalog.TableChange$UpdateColumnType@7e7c730c, org.apache.spark.sql.connector.catalog.TableChange$UpdateColumnPosition@bc842915]

== Parsed Logical Plan ==
'AlterColumns [unresolvedfieldname(a), unresolvedfieldname(b), unresolvedfieldname(x, y, z)], [AlterColumnSpec(None,None,Some(new comment),None,None), AlterColumnSpec(Some(LongType),None,None,None,None), AlterColumnSpec(None,None,None,Some(unresolvedfieldposition(FIRST)),None)]
+- 'UnresolvedTable [test_table], ALTER TABLE ... ALTER COLUMN

== Analyzed Logical Plan ==
AlterColumns [resolvedfieldname(StructField(a,IntegerType,true)), resolvedfieldname(StructField(b,IntegerType,true)), resolvedfieldname(x, y, StructField(z,IntegerType,true))], [AlterColumnSpec(None,None,Some(new comment),None,None), AlterColumnSpec(Some(LongType),None,None,None,None), AlterColumnSpec(None,None,None,Some(resolvedfieldposition(FIRST)),None)]
+- ResolvedTable org.apache.spark.sql.delta.catalog.DeltaCatalog@6d89c923, default.test_table, DeltaTableV2(...)),Some(default.test_table),None,Map()), [a#163, b#164, x#165]

== Physical Plan ==
AlterTable org.apache.spark.sql.delta.catalog.DeltaCatalog@6d89c923, default.test_table, [org.apache.spark.sql.connector.catalog.TableChange$UpdateColumnComment@ff58ec42, org.apache.spark.sql.connector.catalog.TableChange$UpdateColumnType@7e7c730c, org.apache.spark.sql.connector.catalog.TableChange$UpdateColumnPosition@bc842915]

Why are the changes needed?

The current ALTER TABLE ... ALTER COLUMN syntax allows altering only one column at a time. For a large table with many columns, a command must be run for each column, which can be slow due to the repeated preprocessing and I/O costs. A new syntax that enables specifying multiple columns could allow these costs to be shared across multiple column changes.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

New unit tests

Was this patch authored or co-authored using generative AI tooling?

No

MaxGekk

@ctring Need to explicitly define semantic of new command: is it atomic or not. If it fails on one of columns, does it leave modified columns as is? Please, describe this in PR's description.

ctring · 2025-01-21T18:36:14Z

@MaxGekk I updated the PR

scovich

Not a committer... but I've seen enough workloads with wide tables and frequent tool-generated schema changes that this looks appealing.

scovich · 2025-01-21T18:53:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

-        if (a.nullable.isDefined) {
-          if (!a.nullable.get && col.field.nullable) {
+        groupedColumns.keys.foreach { name =>
+          val child = groupedColumns.keys.find(child => child != name && child.startsWith(name))


I think this is trying to handle e.g. x.y.z vs. x.y, but wouldn't this check also flag two (non-nested) fields where one happens to be a prefix of the other? e.g. customer and customer_status

name and child are Seq of the name parts so startsWith is matching the name part prefix.

scovich · 2025-01-21T19:05:58Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala

  }

-  override protected def withNewChildInternal(newChild: LogicalPlan): LogicalPlan =
-    copy(table = newChild)
+  override protected def withNewChildInternal(newChild: LogicalPlan): LogicalPlan = copy(newChild)


how does this change work? Is there some overload of copy available that only takes a table?

I didn't intent to write it this way and will fix this, but I think it still works because the child argument happens to be at the first position.

common/utils/src/main/resources/error/error-conditions.json

cloud-fan · 2025-01-22T04:39:20Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala

-        val newDataType = dataType.getOrElse(column.asInstanceOf[ResolvedFieldName].field.dataType)
-        ResolveDefaultColumns.analyze(column.name.last, newDataType, newDefaultExpression,
-          "ALTER TABLE ALTER COLUMN")
+    assert(columns.size == specs.size)


Shall we include FieldName in AlterColumnSpec to make it more type-safe? The AlterColumnSpec can be an expression as well, similar to MergeAction

It would be much nicer that way. I didn't know making AlterColumnSpec an expression was an option. I've updated the PR with this change.

cloud-fan · 2025-01-23T02:59:32Z

thanks, merging to master/4.0!

### What changes were proposed in this pull request? We propose the following new syntax for altering multiple columns at the same time: ``` ALTER TABLE table_name ALTER COLUMN { { column_identifier | field_name } { COMMENT comment | { FIRST | AFTER identifier } | { SET | DROP } NOT NULL | TYPE data_type | SET DEFAULT clause | DROP DEFAULT } } [, ...] ``` For example: ``` ALTER TABLE test_table ALTER COLUMN a COMMENT "new comment", b TYPE BIGINT, x.y.z FIRST ``` This new syntax is backward compatible with the current syntax. To bound the complexity of the initial support of this syntax we place the following restrictions: + Altering the same column multiple times is not allowed + Altering a parent and a child column (for nested data type) is not allowed. + Altering v1 tables with this new syntax is not allowed. In terms of implementation, we modify the current `AlterColumn` logical plan to be `AlterColumns` that can take in multiple columns and `AlterColumnSpec`s. All `AlterColumnSpec`s are checked during analyzing phase, so if one of them is invalid (e.g., non-existent column, wrong type conversion, etc), the entire command will fail. The `AlterColumnSpec`s are transformed into `TableChange`s, which are passed to the `TableCatalog::alterTable` method. Therefore, the semantics of this new command (atomic vs non-atomic) depends on the implementation of this method. The `V2SessionCatalog::alterTable` currently applies all table changes to the catalog table and then send to the catalog in one request. As a result, column changes are by default applied to the catalog (HMS) atomically: either all changes are made or none are. For example, the above command produces the following plans: ``` == Physical Plan == AlterTable org.apache.spark.sql.delta.catalog.DeltaCatalog6d89c923, default.test_table, [org.apache.spark.sql.connector.catalog.TableChange$UpdateColumnCommentff58ec42, org.apache.spark.sql.connector.catalog.TableChange$UpdateColumnType7e7c730c, org.apache.spark.sql.connector.catalog.TableChange$UpdateColumnPositionbc842915] == Parsed Logical Plan == 'AlterColumns [unresolvedfieldname(a), unresolvedfieldname(b), unresolvedfieldname(x, y, z)], [AlterColumnSpec(None,None,Some(new comment),None,None), AlterColumnSpec(Some(LongType),None,None,None,None), AlterColumnSpec(None,None,None,Some(unresolvedfieldposition(FIRST)),None)] +- 'UnresolvedTable [test_table], ALTER TABLE ... ALTER COLUMN == Analyzed Logical Plan == AlterColumns [resolvedfieldname(StructField(a,IntegerType,true)), resolvedfieldname(StructField(b,IntegerType,true)), resolvedfieldname(x, y, StructField(z,IntegerType,true))], [AlterColumnSpec(None,None,Some(new comment),None,None), AlterColumnSpec(Some(LongType),None,None,None,None), AlterColumnSpec(None,None,None,Some(resolvedfieldposition(FIRST)),None)] +- ResolvedTable org.apache.spark.sql.delta.catalog.DeltaCatalog6d89c923, default.test_table, DeltaTableV2(...)),Some(default.test_table),None,Map()), [a#163, b#164, x#165] == Physical Plan == AlterTable org.apache.spark.sql.delta.catalog.DeltaCatalog6d89c923, default.test_table, [org.apache.spark.sql.connector.catalog.TableChange$UpdateColumnCommentff58ec42, org.apache.spark.sql.connector.catalog.TableChange$UpdateColumnType7e7c730c, org.apache.spark.sql.connector.catalog.TableChange$UpdateColumnPositionbc842915] ``` ### Why are the changes needed? The current ALTER TABLE ... ALTER COLUMN syntax allows altering only one column at a time. For a large table with many columns, a command must be run for each column, which can be slow due to the repeated preprocessing and I/O costs. A new syntax that enables specifying multiple columns could allow these costs to be shared across multiple column changes. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? New unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #49559 from ctring/bulk-alter-column. Authored-by: Cuong Nguyen <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 4b35282) Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label Jan 17, 2025

ctring force-pushed the bulk-alter-column branch from 7524bee to 18cc4bb Compare January 17, 2025 20:41

ctring changed the title ~~[SPARK-50883] Support altering multiple columns in the same command~~ [SPARK-50883][SQL] Support altering multiple columns in the same command Jan 19, 2025

ctring force-pushed the bulk-alter-column branch from 18cc4bb to b69f332 Compare January 20, 2025 07:23

MaxGekk reviewed Jan 20, 2025

View reviewed changes

Support altering multiple columns in the same command

c24c380

ctring force-pushed the bulk-alter-column branch from b69f332 to c24c380 Compare January 21, 2025 06:14

ctring requested a review from MaxGekk January 21, 2025 18:36

scovich reviewed Jan 21, 2025

View reviewed changes

Add explicit argument to copy

0464aa9

ctring requested a review from scovich January 21, 2025 19:53

cloud-fan reviewed Jan 22, 2025

View reviewed changes

common/utils/src/main/resources/error/error-conditions.json Show resolved Hide resolved

cloud-fan reviewed Jan 22, 2025

View reviewed changes

common/utils/src/main/resources/error/error-conditions.json Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 22, 2025

View reviewed changes

Include column name in AlterColumnSpec

177519a

ctring requested a review from cloud-fan January 22, 2025 22:43

cloud-fan approved these changes Jan 23, 2025

View reviewed changes

cloud-fan closed this in 4b35282 Jan 23, 2025

ctring deleted the bulk-alter-column branch January 23, 2025 05:32

ctring mentioned this pull request Jan 28, 2025

[SPARK-51010][SQL] Fix AlterColumnSpec not reporting resolved status correctly #49705

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50883][SQL] Support altering multiple columns in the same command #49559

[SPARK-50883][SQL] Support altering multiple columns in the same command #49559

ctring commented Jan 17, 2025 •

edited

Loading

MaxGekk left a comment •

edited

Loading

ctring commented Jan 21, 2025

scovich left a comment

scovich Jan 21, 2025

ctring Jan 21, 2025

scovich Jan 21, 2025

ctring Jan 21, 2025

cloud-fan Jan 22, 2025

ctring Jan 22, 2025 •

edited

Loading

cloud-fan commented Jan 23, 2025

[SPARK-50883][SQL] Support altering multiple columns in the same command #49559

[SPARK-50883][SQL] Support altering multiple columns in the same command #49559

Conversation

ctring commented Jan 17, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

MaxGekk left a comment • edited Loading

Choose a reason for hiding this comment

ctring commented Jan 21, 2025

scovich left a comment

Choose a reason for hiding this comment

scovich Jan 21, 2025

Choose a reason for hiding this comment

ctring Jan 21, 2025

Choose a reason for hiding this comment

scovich Jan 21, 2025

Choose a reason for hiding this comment

ctring Jan 21, 2025

Choose a reason for hiding this comment

cloud-fan Jan 22, 2025

Choose a reason for hiding this comment

ctring Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

cloud-fan commented Jan 23, 2025

ctring commented Jan 17, 2025 •

edited

Loading

MaxGekk left a comment •

edited

Loading

ctring Jan 22, 2025 •

edited

Loading