[SPARK-53771][PYTHON][ARROW] Add support for large list type in Arrow conversion #52498

Yicong-Huang · 2025-10-01T16:25:10Z

What changes were proposed in this pull request?

Adds a new config spark.sql.execution.arrow.useLargeListType that uses the LargeList Arrow type for array columns in Arrow-based operations (UDFs, Pandas conversions, etc.). This Arrow type makes Arrow use LargeListVector instead of the regular ListVector. This config is disabled by default to maintain the current behavior.

Why are the changes needed?

ListVector has a size limit of 2 GiB for a single array column in a record batch. This is because it uses 4-byte integers to track the offsets of each array value in the vector. During certain operations with large or deeply nested arrays, it is possible to hit this limit. The most affected scenarios include:

applyInPandas operations with array columns, since the entire group is sent as a single RecordBatch
Deeply nested array structures (arrays within arrays within structs)
Operations with large array values that cannot be chunked smaller than the entire group
Other map and UDF operations with array data that would benefit from removing this limit

The LargeListVector type uses an 8-byte long to track array value offsets, removing the 2 GiB limit per array column.

Does this PR introduce any user-facing change?

Yes, adds a new configuration option spark.sql.execution.arrow.useLargeListType that can help users work around what currently results in IndexOutOfBoundsException or segmentation faults when processing large array columns. This exception being raised is a limitation that should suggest using the large list types instead.

How was this patch tested?

Added new unit tests in ArrowUtilsSuite for schema conversion with large list types
Added new unit tests in ArrowWriterSuite for writing data with LargeListVector
Added integration test in ArrowTestsMixin (test_large_list_type_config) that verifies the config works correctly with Pandas conversion

Was this patch authored or co-authored using generative AI tooling?

Yes, tests generated by Copilot.

sql/catalyst/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala

ueshin · 2025-10-03T18:28:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

    errorOnDuplicatedFieldNames: Boolean,
-    largeVarTypes: Boolean) {
+    largeVarTypes: Boolean,
+    largeListType: Boolean = false) {


Maybe we shouldn't use the default value to avoid unexpected calls?

I have removed it.

ueshin · 2025-10-03T18:29:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

      errorOnDuplicatedFieldNames: Boolean,
-      largeVarTypes: Boolean): Array[Byte] = {
+      largeVarTypes: Boolean,
+      largeListType: Boolean = false): Array[Byte] = {


ueshin · 2025-10-03T18:32:31Z

sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowWriterSuite.scala

+    assert(array4.getInt(2) === 8)
+
+    // Verify that the underlying vector is a LargeListVector
+    import org.apache.arrow.vector.complex.LargeListVector


Move this import to the import group in the header?

moved to the header!

feat: add large list type support in arrow schema

47cfdd0

github-actions bot added SQL STRUCTURED STREAMING PYTHON CONNECT labels Oct 1, 2025

Yicong-Huang changed the title ~~[WIP][SPARK-53771][PYTHON][ARROW] Add support for large list type in Arrow conversion~~ [SPARK-53771][PYTHON][ARROW] Add support for large list type in Arrow conversion Oct 1, 2025

github-actions bot added the R label Oct 1, 2025

HyukjinKwon reviewed Oct 1, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala Outdated Show resolved Hide resolved

Yicong-Huang added 5 commits October 2, 2025 23:44

fix: test

903353e

fix: add missing arguments

c470093

fix: add missing argument

659c154

fix: add missing argument to ArrowConverters.toBatchIterator calls

7ba16bc

fix: add missing arguments

3d88a0b

Yicong-Huang force-pushed the SPARK-53771/feat/add-useLargeListType-config branch from 2c746b1 to 3d88a0b Compare October 3, 2025 03:46

Yicong-Huang added 3 commits October 2, 2025 23:53

fix: format

913badb

fix: support arrow column vector

176220d

fix: format

02da1e6

ueshin reviewed Oct 3, 2025

View reviewed changes

take care of comments

3fea39d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53771][PYTHON][ARROW] Add support for large list type in Arrow conversion #52498

[SPARK-53771][PYTHON][ARROW] Add support for large list type in Arrow conversion #52498

Uh oh!

Yicong-Huang commented Oct 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

ueshin Oct 3, 2025

Uh oh!

Yicong-Huang Oct 4, 2025

Uh oh!

ueshin Oct 3, 2025

Uh oh!

Yicong-Huang Oct 4, 2025

Uh oh!

ueshin Oct 3, 2025

Uh oh!

Yicong-Huang Oct 4, 2025

Uh oh!

Uh oh!

[SPARK-53771][PYTHON][ARROW] Add support for large list type in Arrow conversion #52498

Are you sure you want to change the base?

[SPARK-53771][PYTHON][ARROW] Add support for large list type in Arrow conversion #52498

Uh oh!

Conversation

Yicong-Huang commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

ueshin Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

ueshin Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

ueshin Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Yicong-Huang commented Oct 1, 2025 •

edited

Loading