Optimization: Go from general logical plans to dask-specific plans?

This question is both a feature proposal and a question. 

Currently, the process from a SQL (string) to a calculated dataframe is as follows:
1. Use Calcite for SQL parsing and non-optional relational algebra creation
2. Optimize with Calcite using rule-based optimizations resulting into a plan of `Logical*`. Those relational plans are quite generic, do not know anything about dask, distributed processing or the data
3. Transform each relational algebra one-by-one to dask API calls
4. Let dask do another round of optimizations on the dask graph and execute it

While this works quite well and is also used e.g. at blazingSQL, there are at least two problems with it:
* Certain optimizations are only possible if you know, that there is a distributed processing involved: basically you can always decide whether to do something in parallel or sequentially and you need to define when to shuffle (or if you need shuffling at all). Example SQL would be `SELECT ROW_NUMBER OVER (...), 1 + SUM(x) OVER (...) FROM data`. It is much more optimal to only do the shuffling needed for the two windows once, and then just add the 1 to one of the columns in place. However currently, there is no way in getting to know that and the two window functions are calculated separately.
* Calcite does not know about the data, which means there can not be any cost-based optimizations (which typically lead to much better results)
* We can not properly implement predicate pushdown on data sources, which understand it (e.g. Hive+parquet, see https://github.com/gallamine/dask-sql/pull/1)

One possibility to solve these problems and therefore maybe boost the speed would be to bring the distributed operations like "Shuffle" etc to Calcite and let it decide when to shuffle and just copy its decision to dask. The resulting optimized plan would then look similar to e.g. Spark's output. I was hoping that we could maybe "copy" the procedure from other Apache Calcite projects and then just execute their optimized physical plan, but so far was not able to find a good project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimization: Go from general logical plans to dask-specific plans? #183

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimization: Go from general logical plans to dask-specific plans? #183

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions