-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-53614][PYTHON] Add Iterator[pandas.DataFrame] support to applyInPandas
#52716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-53614][PYTHON] Add Iterator[pandas.DataFrame] support to applyInPandas
#52716
Conversation
applyInPandasIterator[pandas.DataFrame] support to applyInPandas
Iterator[pandas.DataFrame] support to applyInPandasIterator[pandas.DataFrame] support to applyInPandas
Iterator[pandas.DataFrame] support to applyInPandasIterator[pandas.DataFrame] support to applyInPandas
| from pyspark.sql.functions import pandas_udf, PandasUDFType | ||
| from pyspark.sql.pandas.typehints import infer_group_pandas_eval_type_from_func | ||
| from pyspark.sql.pandas.functions import PythonEvalType | ||
| import warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not need to re-import PythonEvalType and warnings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
|
|
||
| # Yield the generator for this group | ||
| # The generator must be fully consumed before the next group is processed | ||
| yield series_batches_gen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep in line with
spark/python/pyspark/sql/pandas/serializers.py
Lines 1123 to 1146 in 7bd18e3
| import pyarrow as pa | |
| def process_group(batches: "Iterator[pa.RecordBatch]"): | |
| for batch in batches: | |
| struct = batch.column(0) | |
| yield pa.RecordBatch.from_arrays(struct.flatten(), schema=pa.schema(struct.type)) | |
| dataframes_in_group = None | |
| while dataframes_in_group is None or dataframes_in_group > 0: | |
| dataframes_in_group = read_int(stream) | |
| if dataframes_in_group == 1: | |
| batch_iter = process_group(ArrowStreamSerializer.load_stream(self, stream)) | |
| yield batch_iter | |
| # Make sure the batches are fully iterated before getting the next group | |
| for _ in batch_iter: | |
| pass | |
| elif dataframes_in_group != 0: | |
| raise PySparkValueError( | |
| errorClass="INVALID_NUMBER_OF_DATAFRAMES_IN_GROUP", | |
| messageParameters={"dataframes_in_group": str(dataframes_in_group)}, | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aligned
| ) | ||
|
|
||
| # Yield the generator for this group | ||
| # The generator must be fully consumed before the next group is processed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not ture, a UDF can partially consume a group, please refer to the GroupArrowUDFSerializer and the test
spark/python/pyspark/sql/tests/arrow/test_arrow_grouped_map.py
Lines 327 to 343 in 85c9fd1
| def test_apply_in_arrow_partial_iteration(self): | |
| with self.sql_conf({"spark.sql.execution.arrow.maxRecordsPerBatch": 2}): | |
| def func(group: Iterator[pa.RecordBatch]) -> Iterator[pa.RecordBatch]: | |
| first = next(group) | |
| yield pa.RecordBatch.from_pylist( | |
| [{"value": r.as_py() % 4} for r in first.column(0)] | |
| ) | |
| df = self.spark.range(20) | |
| grouped_df = df.groupBy((col("id") % 4).cast("int")) | |
| # Should get two records for each group | |
| expected = [Row(value=x) for x in [0, 0, 1, 1, 2, 2, 3, 3]] | |
| actual = grouped_df.applyInArrow(func, "value long").collect() | |
| self.assertEqual(actual, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's true. updated the comment and added a partial test.
Iterator[pandas.DataFrame] support to applyInPandasIterator[pandas.DataFrame] support to applyInPandas
What changes were proposed in this pull request?
This PR adds support for the Iterator[pandas.DataFrame] API in
groupBy().applyInPandas(), enabling batch-by-batch processing of grouped data for improved memory efficiency and scalability.Key Changes:
New PythonEvalType: Added
SQL_GROUPED_MAP_PANDAS_ITER_UDF(216) to distinguish iterator-based UDFs from standard grouped map UDFsType Inference: Implemented automatic detection of iterator signatures:
Iterator[pd.DataFrame] -> Iterator[pd.DataFrame]Tuple[Any, ...], Iterator[pd.DataFrame] -> Iterator[pd.DataFrame]Streaming Serialization: Created
GroupPandasIterUDFSerializerthat streams results without materializing all DataFrames in memoryConfiguration Change: Updated
FlatMapGroupsInPandasExecwhich was hardcodingpythonEvalType = 201instead of extracting it from the UDF expression (mirrored fix fromFlatMapGroupsInArrowExec)Why are the changes needed?
The existing
applyInPandas()API loads entire groups into memory as single DataFrames. For large groups, this can cause OOM errors. The iterator API allows:applyInArrow()iterator API designDoes this PR introduce any user-facing changes?
Yes, this PR adds a new API variant for
applyInPandas():Before (existing API, still supported):
After (new iterator API):
With Grouping Keys:
Backward Compatibility: The existing DataFrame-to-DataFrame API is fully preserved and continues to work without changes.
How was this patch tested?
test_apply_in_pandas_iterator_basic- Basic functionality testtest_apply_in_pandas_iterator_with_keys- Test with grouping keystest_apply_in_pandas_iterator_batch_slicing- Pressure test with 10M rows, 20 columnstest_apply_in_pandas_iterator_with_keys_batch_slicing- Pressure test with keysWas this patch authored or co-authored using generative AI tooling?
Yes, tests generated by Cursor.