-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-53771][PYTHON][ARROW] Add support for large list type in Arrow conversion #52498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-53771][PYTHON][ARROW] Add support for large list type in Arrow conversion #52498
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala
Outdated
Show resolved
Hide resolved
2c746b1
to
3d88a0b
Compare
errorOnDuplicatedFieldNames: Boolean, | ||
largeVarTypes: Boolean) { | ||
largeVarTypes: Boolean, | ||
largeListType: Boolean = false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we shouldn't use the default value to avoid unexpected calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed it.
errorOnDuplicatedFieldNames: Boolean, | ||
largeVarTypes: Boolean): Array[Byte] = { | ||
largeVarTypes: Boolean, | ||
largeListType: Boolean = false): Array[Byte] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
assert(array4.getInt(2) === 8) | ||
|
||
// Verify that the underlying vector is a LargeListVector | ||
import org.apache.arrow.vector.complex.LargeListVector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this import to the import group in the header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to the header!
What changes were proposed in this pull request?
Adds a new config
spark.sql.execution.arrow.useLargeListType
that uses theLargeList
Arrow type for array columns in Arrow-based operations (UDFs, Pandas conversions, etc.). This Arrow type makes Arrow useLargeListVector
instead of the regularListVector
. This config is disabled by default to maintain the current behavior.Why are the changes needed?
ListVector
has a size limit of 2 GiB for a single array column in a record batch. This is because it uses 4-byte integers to track the offsets of each array value in the vector. During certain operations with large or deeply nested arrays, it is possible to hit this limit. The most affected scenarios include:applyInPandas
operations with array columns, since the entire group is sent as a single RecordBatchThe
LargeListVector
type uses an 8-byte long to track array value offsets, removing the 2 GiB limit per array column.Does this PR introduce any user-facing change?
Yes, adds a new configuration option
spark.sql.execution.arrow.useLargeListType
that can help users work around what currently results inIndexOutOfBoundsException
or segmentation faults when processing large array columns. This exception being raised is a limitation that should suggest using the large list types instead.How was this patch tested?
ArrowUtilsSuite
for schema conversion with large list typesArrowWriterSuite
for writing data withLargeListVector
ArrowTestsMixin
(test_large_list_type_config
) that verifies the config works correctly with Pandas conversionWas this patch authored or co-authored using generative AI tooling?
Yes, tests generated by Copilot.