You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Comet do conversion of arrow-backed columnar data to row-oriented JVM data
Spark do internal conversion of row-oriented JVM data to arrow-backed columnar data
This columnar data is passed to PythonMapInArrow
It seems to me that points 2-3 are redundant and the arrow batches that are required for mapInArrow can be created directly from the Comet arrow-backed columns and this operation should be a kind of zero-copy... And actually the back conversion from spark columnar batch to comet columnar batch may zero-copy too, so in theory Comet does not need to make a fallback to spark in this case, right?
Describe the potential solution
I do no know an exact solution. It is mostly a question.
Additional context
I'm willing to implement it by myself, I'm ready to work on it. But I need a guidance and help with an overall design of how it should be done (if it is feasible).
The native support of arrow-backed UDFs opens a lot of new cool ways of using Comet. I see that it can gain a huge boost for most of ML/MLOps tasks that are typically done in Spark via arrow-backed UDFs (pandas, polars, pyarrow or even rust code built with maturin to a python-module).
The text was updated successfully, but these errors were encountered:
What is the problem the feature request solves?
Spark provide multiple ways to run arrow-backed UDFs. The current 3.5 supports
mapInArrow
, in the future 4.0 there will be alsoapplyInArrow
.My understanding of how it works in Spark under the hood is quite limited, so correct me if I'm wrong. At the moment, if Spark see in the plan
PythonMapInArrow
it will internaly do a conversion from rows to arrow-batches that should be a columnar representation of the data:https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/MapInBatchExec.scala#L36
That is a minimal example of running
mapInArrow
in Spark 3.4:If I try to run it with Comet enabled it will generate the following physical plan:
If I understand it right, the following happens:
PythonMapInArrow
It seems to me that points 2-3 are redundant and the arrow batches that are required for
mapInArrow
can be created directly from the Comet arrow-backed columns and this operation should be a kind of zero-copy... And actually the back conversion from spark columnar batch to comet columnar batch may zero-copy too, so in theory Comet does not need to make a fallback to spark in this case, right?Describe the potential solution
I do no know an exact solution. It is mostly a question.
Additional context
I'm willing to implement it by myself, I'm ready to work on it. But I need a guidance and help with an overall design of how it should be done (if it is feasible).
The native support of arrow-backed UDFs opens a lot of new cool ways of using Comet. I see that it can gain a huge boost for most of ML/MLOps tasks that are typically done in Spark via arrow-backed UDFs (
pandas
,polars
,pyarrow
or even rust code built withmaturin
to a python-module).The text was updated successfully, but these errors were encountered: