Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Add partitioning push down #23432

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Conversation

dain
Copy link
Member

@dain dain commented Sep 16, 2024

Description

Add partitioning push down to table scan which a connector can use to activate optional partitioning, or choose between multiple partitioning strategies. This replaces the existing Metadata makeCompatiblePartitioning and getCommonPartitioningHandle methods used exclusively by Hive with a more generic applyPartitioning method.

Hive has been updated to the new system, and now only applies bucketed execution when it is actually used in the coordinator. This can improve performance when parallelism is limited by the bucketing and the bucketing isn't necessary for the query.

Iceberg has been updated to support bucketed execution also. This applies the same optimizations available to Hive which allows the engine to eliminate unnecessary redistribution of tables. Additionally, since Iceberg supports multiple independent partitioning functions, a table can effectively have multiple distributions, which makes the optimization
even more effective.

Iceberg bucket execution can be controlled with the iceberg.bucket-execution-mode configuration property and the bucket_execution session property. The mode can be set to NEVER, AUTO, or ALWAYS. AUTO is the default and enables bucked execution when the bucket count is equal to or greater than the current node count.

TODO

  • Iceberg tests similar to the Hive tests
  • Change Hive bucket execution configuration and sesion properties to match new Iceberg properties

Follup Work

  • AddExchanges does not propigate preferred partitioning through joins, which reduces effectiveness of compatible partition used in Hive and Iceberg
  • Iceberg support for mismatched buckets
  • Add stable node-bucket assignments in system assigned bucketing to improve file system caching

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text:

# SPI
* Add partitioning push down. ({issue}`issuenumber`)

# Iceberg
* Add bucketed execution which can improve performance when running a join or aggregation on a bucketed table.

@cla-bot cla-bot bot added the cla-signed label Sep 16, 2024
@github-actions github-actions bot added iceberg Iceberg connector hive Hive connector labels Sep 16, 2024
Add partitioning push down to table scan which connector can use to
activate optional partitioning, or choose between multiple partitioning
strategies.

This replaces the existing Metadata makeCompatiblePartitioning and
getCommonPartitioningHandle methods used exclusively by Hive
Add support for pushing plan partitioning into Iceberg when Iceberg
tables use hash bucked partitioning. This enables co-located joins which
can be significantly more efficient. Additionally, since Iceberg
supports multiple independent partitioning functions, a table can
effectively have multiple distributions, which makes the optimization
more effective.

This feature can be controlled with the iceberg.bucket-execution-mode
configuration property and the bucket_execution session property. Mode
can be set to NEVER, AUTO, or ALWAYS. AUTO is the default and enables
bucked execution when the bucket count is equal to or greater than the
current node count.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

1 participant