Skip to content

Conversation

Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Oct 1, 2025

What changes were proposed in this pull request?

Adds a new config spark.sql.execution.arrow.useLargeListType that uses the LargeList Arrow type for array columns in Arrow-based operations (UDFs, Pandas conversions, etc.). This Arrow type makes Arrow use LargeListVector instead of the regular ListVector. This config is disabled by default to maintain the current behavior.

Why are the changes needed?

ListVector has a size limit of 2 GiB for a single array column in a record batch. This is because it uses 4-byte integers to track the offsets of each array value in the vector. During certain operations with large or deeply nested arrays, it is possible to hit this limit. The most affected scenarios include:

  • applyInPandas operations with array columns, since the entire group is sent as a single RecordBatch
  • Deeply nested array structures (arrays within arrays within structs)
  • Operations with large array values that cannot be chunked smaller than the entire group
  • Other map and UDF operations with array data that would benefit from removing this limit

The LargeListVector type uses an 8-byte long to track array value offsets, removing the 2 GiB limit per array column.

Does this PR introduce any user-facing change?

Yes, adds a new configuration option spark.sql.execution.arrow.useLargeListType that can help users work around what currently results in IndexOutOfBoundsException or segmentation faults when processing large array columns. This exception being raised is a limitation that should suggest using the large list types instead.

How was this patch tested?

  • Added new unit tests in ArrowUtilsSuite for schema conversion with large list types
  • Added new unit tests in ArrowWriterSuite for writing data with LargeListVector
  • Added integration test in ArrowTestsMixin (test_large_list_type_config) that verifies the config works correctly with Pandas conversion

Was this patch authored or co-authored using generative AI tooling?

Yes, tests generated by Copilot.

@Yicong-Huang Yicong-Huang changed the title [WIP][SPARK-53771][PYTHON][ARROW] Add support for large list type in Arrow conversion [SPARK-53771][PYTHON][ARROW] Add support for large list type in Arrow conversion Oct 1, 2025
@github-actions github-actions bot added the R label Oct 1, 2025
@Yicong-Huang Yicong-Huang force-pushed the SPARK-53771/feat/add-useLargeListType-config branch from 2c746b1 to 3d88a0b Compare October 3, 2025 03:46
errorOnDuplicatedFieldNames: Boolean,
largeVarTypes: Boolean) {
largeVarTypes: Boolean,
largeListType: Boolean = false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we shouldn't use the default value to avoid unexpected calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed it.

errorOnDuplicatedFieldNames: Boolean,
largeVarTypes: Boolean): Array[Byte] = {
largeVarTypes: Boolean,
largeListType: Boolean = false): Array[Byte] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

assert(array4.getInt(2) === 8)

// Verify that the underlying vector is a LargeListVector
import org.apache.arrow.vector.complex.LargeListVector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this import to the import group in the header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to the header!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants