[SPARK-53614][PYTHON] Add `Iterator[pandas.DataFrame]` support to `applyInPandas` #52716

Yicong-Huang · 2025-10-23T22:05:24Z

What changes were proposed in this pull request?

This PR adds support for the Iterator[pandas.DataFrame] API in groupBy().applyInPandas(), enabling batch-by-batch processing of grouped data for improved memory efficiency and scalability.

Key Changes:

New PythonEvalType: Added SQL_GROUPED_MAP_PANDAS_ITER_UDF (216) to distinguish iterator-based UDFs from standard grouped map UDFs
Type Inference: Implemented automatic detection of iterator signatures:
- Iterator[pd.DataFrame] -> Iterator[pd.DataFrame]
- Tuple[Any, ...], Iterator[pd.DataFrame] -> Iterator[pd.DataFrame]
Streaming Serialization: Created GroupPandasIterUDFSerializer that streams results without materializing all DataFrames in memory
Configuration Change: Updated FlatMapGroupsInPandasExec which was hardcoding pythonEvalType = 201 instead of extracting it from the UDF expression (mirrored fix from FlatMapGroupsInArrowExec)

Why are the changes needed?

The existing applyInPandas() API loads entire groups into memory as single DataFrames. For large groups, this can cause OOM errors. The iterator API allows:

Memory Efficiency: Process data batch-by-batch instead of materializing entire groups
Scalability: Handle arbitrarily large groups that don't fit in memory
Consistency: Mirrors the existing applyInArrow() iterator API design

Does this PR introduce any user-facing changes?

Yes, this PR adds a new API variant for applyInPandas():

Before (existing API, still supported):

def normalize(pdf: pd.DataFrame) -> pd.DataFrame:
    return pdf.assign(v=(pdf.v - pdf.v.mean()) / pdf.v.std())

df.groupBy("id").applyInPandas(normalize, schema="id long, v double")

After (new iterator API):

from typing import Iterator

def normalize(batches: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    # Process data batch-by-batch
    for batch in batches:
        yield batch.assign(v=(batch.v - batch.v.mean()) / batch.v.std())

df.groupBy("id").applyInPandas(normalize, schema="id long, v double")

With Grouping Keys:

from typing import Iterator, Tuple, Any

def sum_by_key(key: Tuple[Any, ...], batches: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    total = 0
    for batch in batches:
        total += batch['v'].sum()
    yield pd.DataFrame({"id": [key[0]], "total": [total]})

df.groupBy("id").applyInPandas(sum_by_key, schema="id long, total double")

Backward Compatibility: The existing DataFrame-to-DataFrame API is fully preserved and continues to work without changes.

How was this patch tested?

Added test_apply_in_pandas_iterator_basic - Basic functionality test
Added test_apply_in_pandas_iterator_with_keys - Test with grouping keys
Added test_apply_in_pandas_iterator_batch_slicing - Pressure test with 10M rows, 20 columns
Added test_apply_in_pandas_iterator_with_keys_batch_slicing - Pressure test with keys

Was this patch authored or co-authored using generative AI tooling?

Yes, tests generated by Cursor.

zhengruifeng · 2025-10-28T03:18:30Z

python/pyspark/sql/pandas/group_ops.py

        from pyspark.sql.functions import pandas_udf, PandasUDFType
+        from pyspark.sql.pandas.typehints import infer_group_pandas_eval_type_from_func
+        from pyspark.sql.pandas.functions import PythonEvalType
+        import warnings


not need to re-import PythonEvalType and warnings

python/pyspark/sql/pandas/serializers.py

python/pyspark/worker.py

zhengruifeng · 2025-10-29T01:22:46Z

python/pyspark/sql/pandas/serializers.py

+
+                # Yield the generator for this group
+                # The generator must be fully consumed before the next group is processed
+                yield series_batches_gen


let's keep in line with

spark/python/pyspark/sql/pandas/serializers.py

Lines 1123 to 1146 in 7bd18e3

import pyarrow as pa

def process_group(batches: "Iterator[pa.RecordBatch]"):

for batch in batches:

struct = batch.column(0)

yield pa.RecordBatch.from_arrays(struct.flatten(), schema=pa.schema(struct.type))

dataframes_in_group = None

while dataframes_in_group is None or dataframes_in_group > 0:

dataframes_in_group = read_int(stream)

if dataframes_in_group == 1:

batch_iter = process_group(ArrowStreamSerializer.load_stream(self, stream))

yield batch_iter

# Make sure the batches are fully iterated before getting the next group

for _ in batch_iter:

pass

elif dataframes_in_group != 0:

raise PySparkValueError(

errorClass="INVALID_NUMBER_OF_DATAFRAMES_IN_GROUP",

messageParameters={"dataframes_in_group": str(dataframes_in_group)},

)

zhengruifeng · 2025-10-29T01:23:58Z

python/pyspark/sql/pandas/serializers.py

+                )
+
+                # Yield the generator for this group
+                # The generator must be fully consumed before the next group is processed


this is not ture, a UDF can partially consume a group, please refer to the GroupArrowUDFSerializer and the test

spark/python/pyspark/sql/tests/arrow/test_arrow_grouped_map.py

Lines 327 to 343 in 85c9fd1

def test_apply_in_arrow_partial_iteration(self):

with self.sql_conf({"spark.sql.execution.arrow.maxRecordsPerBatch": 2}):

def func(group: Iterator[pa.RecordBatch]) -> Iterator[pa.RecordBatch]:

first = next(group)

yield pa.RecordBatch.from_pylist(

[{"value": r.as_py() % 4} for r in first.column(0)]

)

df = self.spark.range(20)

grouped_df = df.groupBy((col("id") % 4).cast("int"))

# Should get two records for each group

expected = [Row(value=x) for x in [0, 0, 1, 1, 2, 2, 3, 3]]

actual = grouped_df.applyInArrow(func, "value long").collect()

self.assertEqual(actual, expected)

that's true. updated the comment and added a partial test.

Yicong-Huang added 2 commits October 20, 2025 13:26

wip

a4cf26a

add pressure test

e2093df

github-actions bot added SQL CORE PYTHON CONNECT labels Oct 23, 2025

Yicong-Huang added 2 commits October 27, 2025 12:50

fix return type

cf770a3

add stream serializer

453fcb6

Yicong-Huang changed the title ~~[WIP][SPARK-53614] Add applyInPandas~~ [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025

Yicong-Huang changed the title ~~[WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas~~ [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025

Yicong-Huang changed the title ~~[WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas~~ [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025

Yicong-Huang added 2 commits October 27, 2025 13:17

update example

080fe9e

add tests

471c53f

Yicong-Huang changed the title ~~[WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas~~ [SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025

fix test

efd7255

zhengruifeng reviewed Oct 28, 2025

View reviewed changes

Yicong-Huang added 3 commits October 28, 2025 11:50

fix iterator

e32e7d2

take care of comments

056d1fe

fix

5596146

zhengruifeng reviewed Oct 29, 2025

View reviewed changes

zhengruifeng changed the title ~~[SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas~~ [SPARK-53614][PYTHON] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 29, 2025

Yicong-Huang added 3 commits October 29, 2025 16:52

fix: formatting and docs

a630275

fix: allow partial consuming of input iterator

33f2d80

fix: align with GroupPandasUDFSerializer

6e7bd3e

Yicong-Huang requested a review from zhengruifeng October 30, 2025 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[SPARK-53614][PYTHON] Add `Iterator[pandas.DataFrame]` support to `applyInPandas` #52716

[SPARK-53614][PYTHON] Add `Iterator[pandas.DataFrame]` support to `applyInPandas` #52716

Yicong-Huang commented Oct 23, 2025 •

edited

Loading

Uh oh!

zhengruifeng Oct 28, 2025

Uh oh!

Yicong-Huang Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

zhengruifeng Oct 29, 2025

Uh oh!

Yicong-Huang Oct 30, 2025

Uh oh!

zhengruifeng Oct 29, 2025

Uh oh!

Yicong-Huang Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	import pyarrow as pa

	def process_group(batches: "Iterator[pa.RecordBatch]"):
	for batch in batches:
	struct = batch.column(0)
	yield pa.RecordBatch.from_arrays(struct.flatten(), schema=pa.schema(struct.type))

	dataframes_in_group = None

	while dataframes_in_group is None or dataframes_in_group > 0:
	dataframes_in_group = read_int(stream)

	if dataframes_in_group == 1:
	batch_iter = process_group(ArrowStreamSerializer.load_stream(self, stream))
	yield batch_iter
	# Make sure the batches are fully iterated before getting the next group
	for _ in batch_iter:
	pass

	elif dataframes_in_group != 0:
	raise PySparkValueError(
	errorClass="INVALID_NUMBER_OF_DATAFRAMES_IN_GROUP",
	messageParameters={"dataframes_in_group": str(dataframes_in_group)},
	)

	def test_apply_in_arrow_partial_iteration(self):
	with self.sql_conf({"spark.sql.execution.arrow.maxRecordsPerBatch": 2}):

	def func(group: Iterator[pa.RecordBatch]) -> Iterator[pa.RecordBatch]:
	first = next(group)
	yield pa.RecordBatch.from_pylist(
	[{"value": r.as_py() % 4} for r in first.column(0)]
	)

	df = self.spark.range(20)
	grouped_df = df.groupBy((col("id") % 4).cast("int"))

	# Should get two records for each group
	expected = [Row(value=x) for x in [0, 0, 1, 1, 2, 2, 3, 3]]

	actual = grouped_df.applyInArrow(func, "value long").collect()
	self.assertEqual(actual, expected)

Uh oh!

[SPARK-53614][PYTHON] Add Iterator[pandas.DataFrame] support to applyInPandas #52716

Are you sure you want to change the base?

[SPARK-53614][PYTHON] Add Iterator[pandas.DataFrame] support to applyInPandas #52716

Conversation

Yicong-Huang commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Key Changes:

Why are the changes needed?

Does this PR introduce any user-facing changes?

Before (existing API, still supported):

After (new iterator API):

With Grouping Keys:

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zhengruifeng Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-53614][PYTHON] Add `Iterator[pandas.DataFrame]` support to `applyInPandas` #52716

[SPARK-53614][PYTHON] Add `Iterator[pandas.DataFrame]` support to `applyInPandas` #52716

Yicong-Huang commented Oct 23, 2025 •

edited

Loading