Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Apply DataFusion's projection pushdown rule #907

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Sep 3, 2024

Which issue does this PR close?

Closes #908

Rationale for this change

Improve performance (and reduce memory overhead) of HashJoinExec by pushing down projection into the join leveraging a rule that already exists in DataFusion.

Here is an example from TPC-DS q3.

Before

Note the two instances of ProjectionExec and no projection displayed in the HashJoinExec.

 AggregateExec: mode=Partial, gby=[col_0@0 as col_0, col_3@3 as col_1, col_2@2 as col_2], aggr=[sum]
  ProjectionExec: expr=[col_0@0 as col_0, col_2@2 as col_1, col_1@4 as col_2, col_2@5 as col_3]
    ProjectionExec: expr=[col_0@3 as col_0, col_1@4 as col_1, col_2@5 as col_2, col_0@0 as col_0, col_1@1 as col_1, col_2@2 as col_2]
      HashJoinExec: mode=Partitioned, join_type=Inner, on=[(col_0@0, col_1@1)]

After

Now there is only one ProjectionExec and a projection of 4 columns has been pushed into the HashJoinExec. I suspect that we could remove the remaining ProjectionExec with some additional work (in a separate PR).

 AggregateExec: mode=Partial, gby=[col_0@0 as col_0, col_3@3 as col_1, col_2@2 as col_2], aggr=[sum]
  ProjectionExec: expr=[col_0@2 as col_0, col_2@3 as col_1, col_1@0 as col_2, col_2@1 as col_3]
    HashJoinExec: mode=Partitioned, join_type=Inner, on=[(col_0@0, col_1@1)], projection=[col_1@1, col_2@2, col_0@3, col_2@5]

What changes are included in this PR?

Enable optimizer with projection pushdown rule, which can potentially fuse the projection with other operators.

How are these changes tested?

Existing tests.

Comment on lines -141 to -151
impl Default for PhysicalPlanner {
fn default() -> Self {
let session_ctx = Arc::new(SessionContext::new());
let execution_props = ExecutionProps::new();
Self {
exec_context_id: TEST_EXEC_CONTEXT_ID,
execution_props,
session_ctx,
}
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was only used in tests so I moved it into the test module

@andygrove andygrove marked this pull request as ready for review September 4, 2024 04:08
Ok(Arc::new(CopyExec {
input: new_input,
input: Arc::clone(&children[0]),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the optimizer highlighted that the previous implementation of CopyExec::with_new_children was incorrect.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove
Makes sense to me. Would you mind to share how much is the performance benefit?

let state = SessionStateBuilder::new()
.with_config(session_config)
.with_runtime_env(Arc::new(runtime))
.with_default_features()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default features needed to use the planner?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if I can remove that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this causes test failures. We were previously calling the following method to create the context and this also enabled default features.

    pub fn new_with_config_rt(config: SessionConfig, runtime: Arc<RuntimeEnv>) -> Self {
        let state = SessionStateBuilder::new()
            .with_config(config)
            .with_runtime_env(runtime)
            .with_default_features()
            .build();
        Self::new_with_state(state)
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we using a planner only perhaps this should be enough

.with_query_planner(SessionStateDefaults::default_expr_planners())

?

@andygrove
Copy link
Member Author

Thanks @andygrove Makes sense to me. Would you mind to share how much is the performance benefit?

I don't expect it to have much impact, but I am running some benchmarks now. I will post results later today.

@andygrove
Copy link
Member Author

andygrove commented Sep 4, 2024

The benchmark results are not very exciting, and the improvements could just be noise. However, there is some correlation with the improvements in q18 and q21 that were noted in apache/datafusion#9236 (comment).

tpch_queries_speedup_abs

.with_config(session_config)
.with_runtime_env(Arc::new(runtime))
.with_default_features()
.with_physical_optimizer_rules(vec![Arc::new(ProjectionPushdown::new())])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if other rules can be considered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. We are starting out with a physical plan that is already optimized by Spark though, so many optimizations have already been applied. We also are running queries in single partitions within a distributed cluster so we cannot leverage anything that uses RepartitionExec.

@viirya
Copy link
Member

viirya commented Sep 10, 2024

The benchmark results are not very exciting, and the improvements could just be noise. However, there is some correlation with the improvements in q18 and q21 that were noted in apache/datafusion#9236 (comment).

It makes sense. As the projection happens after join in HashJoin operator, it looks more like an early projection from the upper Projection operator. I don't expect this could get some performance on the join.

@andygrove andygrove marked this pull request as draft October 7, 2024 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Push projections into hash joins
4 participants