-
Notifications
You must be signed in to change notification settings - Fork 72
Open
Description
This question is both a feature proposal and a question.
Currently, the process from a SQL (string) to a calculated dataframe is as follows:
- Use Calcite for SQL parsing and non-optional relational algebra creation
- Optimize with Calcite using rule-based optimizations resulting into a plan of
Logical*
. Those relational plans are quite generic, do not know anything about dask, distributed processing or the data - Transform each relational algebra one-by-one to dask API calls
- Let dask do another round of optimizations on the dask graph and execute it
While this works quite well and is also used e.g. at blazingSQL, there are at least two problems with it:
- Certain optimizations are only possible if you know, that there is a distributed processing involved: basically you can always decide whether to do something in parallel or sequentially and you need to define when to shuffle (or if you need shuffling at all). Example SQL would be
SELECT ROW_NUMBER OVER (...), 1 + SUM(x) OVER (...) FROM data
. It is much more optimal to only do the shuffling needed for the two windows once, and then just add the 1 to one of the columns in place. However currently, there is no way in getting to know that and the two window functions are calculated separately. - Calcite does not know about the data, which means there can not be any cost-based optimizations (which typically lead to much better results)
- We can not properly implement predicate pushdown on data sources, which understand it (e.g. Hive+parquet, see Use the correct hive partition type information (hacky solution) gallamine/dask-sql#1)
One possibility to solve these problems and therefore maybe boost the speed would be to bring the distributed operations like "Shuffle" etc to Calcite and let it decide when to shuffle and just copy its decision to dask. The resulting optimized plan would then look similar to e.g. Spark's output. I was hoping that we could maybe "copy" the procedure from other Apache Calcite projects and then just execute their optimized physical plan, but so far was not able to find a good project.
Metadata
Metadata
Assignees
Labels
No labels