You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Performing an intersect operation adds a Distinct operation to the query plan, however in situations where Distinct has already been applied to the values undergoing the intersect operation, this adds a redundant Distinct operation. For example, the query:
importpandasaspdfromdask_sqlimportContextdf=pd.DataFrame({"a":[1,2,2,3,3], "b":[2,3,3,4,4]})
c=Context()
c.create_table("df", df)
c.explain("SELECT DISTINCT a FROM df INTERSECT SELECT DISTINCT b FROM df")
where there are 3 Distinct operations present.
Another example of this is in query 38 where the query:
select count(*) from (
select distinct c_last_name, c_first_name, d_date
from store_sales, date_dim, customer
where store_sales.ss_sold_date_sk = date_dim.d_date_sk
and store_sales.ss_customer_sk = customer.c_customer_sk
and d_month_seq between 1189 and 1189 + 11
intersect
select distinct c_last_name, c_first_name, d_date
from catalog_sales, date_dim, customer
where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
and d_month_seq between 1189 and 1189 + 11
intersect
select distinct c_last_name, c_first_name, d_date
from web_sales, date_dim, customer
where web_sales.ws_sold_date_sk = date_dim.d_date_sk
and web_sales.ws_bill_customer_sk = customer.c_customer_sk
and d_month_seq between 1189 and 1189 + 11
) hot_cust
limit 100
leads to 5 distinct calls as seen in the explain plan:
Limit: skip=0, fetch=100
Projection: COUNT(UInt8(1))
Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]
LeftSemi Join: customer.c_last_name = customer.c_last_name, customer.c_first_name = customer.c_first_name, date_dim.d_date = date_dim.d_date
Distinct:
LeftSemi Join: customer.c_last_name = customer.c_last_name, customer.c_first_name = customer.c_first_name, date_dim.d_date = date_dim.d_date
Distinct:
Distinct:
Projection: customer.c_last_name, customer.c_first_name, date_dim.d_date
Inner Join: store_sales.ss_customer_sk = customer.c_customer_sk
Inner Join: store_sales.ss_sold_date_sk = date_dim.d_date_sk
Filter: store_sales.ss_customer_sk IS NOT NULL AND store_sales.ss_sold_date_sk IS NOT NULL
TableScan: store_sales projection=[ss_sold_date_sk, ss_customer_sk], partial_filters=[store_sales.ss_customer_sk IS NOT NULL, store_sales.ss_sold_date_sk IS NOT NULL]
Filter: date_dim.d_date_sk IS NOT NULL AND date_dim.d_month_seq >= Int64(1189) AND date_dim.d_month_seq <= Int64(1200)
TableScan: date_dim projection=[d_date_sk, d_date, d_month_seq], partial_filters=[date_dim.d_date_sk IS NOT NULL]
Filter: customer.c_customer_sk IS NOT NULL
TableScan: customer projection=[c_customer_sk, c_first_name, c_last_name], partial_filters=[customer.c_customer_sk IS NOT NULL]
Distinct:
Projection: customer.c_last_name, customer.c_first_name, date_dim.d_date
Inner Join: catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
Inner Join: catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
Filter: catalog_sales.cs_bill_customer_sk IS NOT NULL AND catalog_sales.cs_sold_date_sk IS NOT NULL
TableScan: catalog_sales projection=[cs_sold_date_sk, cs_bill_customer_sk], partial_filters=[catalog_sales.cs_bill_customer_sk IS NOT NULL, catalog_sales.cs_sold_date_sk IS NOT NULL]
Filter: date_dim.d_date_sk IS NOT NULL AND date_dim.d_month_seq >= Int64(1189) AND date_dim.d_month_seq <= Int64(1200)
TableScan: date_dim projection=[d_date_sk, d_date, d_month_seq], partial_filters=[date_dim.d_date_sk IS NOT NULL]
Filter: customer.c_customer_sk IS NOT NULL
TableScan: customer projection=[c_customer_sk, c_first_name, c_last_name], partial_filters=[customer.c_customer_sk IS NOT NULL]
Distinct:
Projection: customer.c_last_name, customer.c_first_name, date_dim.d_date
Inner Join: web_sales.ws_bill_customer_sk = customer.c_customer_sk
Inner Join: web_sales.ws_sold_date_sk = date_dim.d_date_sk
Filter: web_sales.ws_bill_customer_sk IS NOT NULL AND web_sales.ws_sold_date_sk IS NOT NULL
TableScan: web_sales projection=[ws_sold_date_sk, ws_bill_customer_sk], partial_filters=[web_sales.ws_bill_customer_sk IS NOT NULL, web_sales.ws_sold_date_sk IS NOT NULL]
Filter: date_dim.d_date_sk IS NOT NULL AND date_dim.d_month_seq >= Int64(1189) AND date_dim.d_month_seq <= Int64(1200)
TableScan: date_dim projection=[d_date_sk, d_date, d_month_seq], partial_filters=[date_dim.d_date_sk IS NOT NULL]
Filter: customer.c_customer_sk IS NOT NULL
TableScan: customer projection=[c_customer_sk, c_first_name, c_last_name], partial_filters=[customer.c_customer_sk IS NOT NULL]
Describe the solution you'd like
I would like to get rid of these redundant Distinct operrations with an optimizer rule.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Performing an intersect operation adds a
Distinct
operation to the query plan, however in situations where Distinct has already been applied to the values undergoing the intersect operation, this adds a redundantDistinct
operation. For example, the query:results in the explain plan:
where there are 3
Distinct
operations present.Another example of this is in query 38 where the query:
leads to 5 distinct calls as seen in the explain plan:
Describe the solution you'd like
I would like to get rid of these redundant
Distinct
operrations with an optimizer rule.The text was updated successfully, but these errors were encountered: