Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Create an optimizer rule to remove redundant Distinct calls #925

Open
ChrisJar opened this issue Nov 17, 2022 · 0 comments · May be fixed by #1008
Open

[ENH] Create an optimizer rule to remove redundant Distinct calls #925

ChrisJar opened this issue Nov 17, 2022 · 0 comments · May be fixed by #1008
Assignees
Labels
enhancement New feature or request needs triage Awaiting triage by a dask-sql maintainer

Comments

@ChrisJar
Copy link
Collaborator

ChrisJar commented Nov 17, 2022

Is your feature request related to a problem? Please describe.
Performing an intersect operation adds a Distinct operation to the query plan, however in situations where Distinct has already been applied to the values undergoing the intersect operation, this adds a redundant Distinct operation. For example, the query:

import pandas as pd
from dask_sql import Context

df = pd.DataFrame({"a":[1,2,2,3,3], "b":[2,3,3,4,4]})
c = Context()
c.create_table("df", df)

c.explain("SELECT DISTINCT a FROM df INTERSECT SELECT DISTINCT b FROM df")

results in the explain plan:

'LeftSemi Join: df.a = df.b
  Distinct:
    Distinct:
      TableScan: df projection=[a, b]
  Distinct:
    TableScan: df projection=[a, b]'

where there are 3 Distinct operations present.
Another example of this is in query 38 where the query:

select  count(*) from (
    select distinct c_last_name, c_first_name, d_date
    from store_sales, date_dim, customer
          where store_sales.ss_sold_date_sk = date_dim.d_date_sk
      and store_sales.ss_customer_sk = customer.c_customer_sk
      and d_month_seq between 1189 and 1189 + 11
  intersect
    select distinct c_last_name, c_first_name, d_date
    from catalog_sales, date_dim, customer
          where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
      and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
      and d_month_seq between 1189 and 1189 + 11
  intersect
    select distinct c_last_name, c_first_name, d_date
    from web_sales, date_dim, customer
          where web_sales.ws_sold_date_sk = date_dim.d_date_sk
      and web_sales.ws_bill_customer_sk = customer.c_customer_sk
      and d_month_seq between 1189 and 1189 + 11
) hot_cust
limit 100

leads to 5 distinct calls as seen in the explain plan:

Limit: skip=0, fetch=100
  Projection: COUNT(UInt8(1))
    Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]
      LeftSemi Join: customer.c_last_name = customer.c_last_name, customer.c_first_name = customer.c_first_name, date_dim.d_date = date_dim.d_date
        Distinct:
          LeftSemi Join: customer.c_last_name = customer.c_last_name, customer.c_first_name = customer.c_first_name, date_dim.d_date = date_dim.d_date
            Distinct:
              Distinct:
                Projection: customer.c_last_name, customer.c_first_name, date_dim.d_date
                  Inner Join: store_sales.ss_customer_sk = customer.c_customer_sk
                    Inner Join: store_sales.ss_sold_date_sk = date_dim.d_date_sk
                      Filter: store_sales.ss_customer_sk IS NOT NULL AND store_sales.ss_sold_date_sk IS NOT NULL
                        TableScan: store_sales projection=[ss_sold_date_sk, ss_customer_sk], partial_filters=[store_sales.ss_customer_sk IS NOT NULL, store_sales.ss_sold_date_sk IS NOT NULL]
                      Filter: date_dim.d_date_sk IS NOT NULL AND date_dim.d_month_seq >= Int64(1189) AND date_dim.d_month_seq <= Int64(1200)
                        TableScan: date_dim projection=[d_date_sk, d_date, d_month_seq], partial_filters=[date_dim.d_date_sk IS NOT NULL]
                    Filter: customer.c_customer_sk IS NOT NULL
                      TableScan: customer projection=[c_customer_sk, c_first_name, c_last_name], partial_filters=[customer.c_customer_sk IS NOT NULL]
            Distinct:
              Projection: customer.c_last_name, customer.c_first_name, date_dim.d_date
                Inner Join: catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
                  Inner Join: catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
                    Filter: catalog_sales.cs_bill_customer_sk IS NOT NULL AND catalog_sales.cs_sold_date_sk IS NOT NULL
                      TableScan: catalog_sales projection=[cs_sold_date_sk, cs_bill_customer_sk], partial_filters=[catalog_sales.cs_bill_customer_sk IS NOT NULL, catalog_sales.cs_sold_date_sk IS NOT NULL]
                    Filter: date_dim.d_date_sk IS NOT NULL AND date_dim.d_month_seq >= Int64(1189) AND date_dim.d_month_seq <= Int64(1200)
                      TableScan: date_dim projection=[d_date_sk, d_date, d_month_seq], partial_filters=[date_dim.d_date_sk IS NOT NULL]
                  Filter: customer.c_customer_sk IS NOT NULL
                    TableScan: customer projection=[c_customer_sk, c_first_name, c_last_name], partial_filters=[customer.c_customer_sk IS NOT NULL]
        Distinct:
          Projection: customer.c_last_name, customer.c_first_name, date_dim.d_date
            Inner Join: web_sales.ws_bill_customer_sk = customer.c_customer_sk
              Inner Join: web_sales.ws_sold_date_sk = date_dim.d_date_sk
                Filter: web_sales.ws_bill_customer_sk IS NOT NULL AND web_sales.ws_sold_date_sk IS NOT NULL
                  TableScan: web_sales projection=[ws_sold_date_sk, ws_bill_customer_sk], partial_filters=[web_sales.ws_bill_customer_sk IS NOT NULL, web_sales.ws_sold_date_sk IS NOT NULL]
                Filter: date_dim.d_date_sk IS NOT NULL AND date_dim.d_month_seq >= Int64(1189) AND date_dim.d_month_seq <= Int64(1200)
                  TableScan: date_dim projection=[d_date_sk, d_date, d_month_seq], partial_filters=[date_dim.d_date_sk IS NOT NULL]
              Filter: customer.c_customer_sk IS NOT NULL
                TableScan: customer projection=[c_customer_sk, c_first_name, c_last_name], partial_filters=[customer.c_customer_sk IS NOT NULL]

Describe the solution you'd like
I would like to get rid of these redundant Distinct operrations with an optimizer rule.

@ChrisJar ChrisJar added enhancement New feature or request needs triage Awaiting triage by a dask-sql maintainer labels Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs triage Awaiting triage by a dask-sql maintainer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant