Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49557][SQL] Add SQL pipe syntax for the WHERE operator #48091

Closed
wants to merge 16 commits into from

Conversation

dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Sep 12, 2024

What changes were proposed in this pull request?

This PR adds SQL pipe syntax support for the WHERE operator.

For example:

CREATE TABLE t(x INT, y STRING) USING CSV;
INSERT INTO t VALUES (0, 'abc'), (1, 'def');

CREATE TABLE other(a INT, b INT) USING JSON;
INSERT INTO other VALUES (1, 1), (1, 2), (2, 4);

TABLE t
|> WHERE x + LENGTH(y) < 4;

0	abc

TABLE t
|> WHERE (SELECT ANY_VALUE(a) FROM other WHERE x = a LIMIT 1) = 1

1       def

TABLE t
|> WHERE SUM(x) = 1

Error: aggregate functions are not allowed in the pipe operator |> WHERE clause

Why are the changes needed?

The SQL pipe operator syntax will let users compose queries in a more flexible fashion.

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

This PR adds a few unit test cases, but mostly relies on golden file test coverage. I did this to make sure the answers are correct as this feature is implemented and also so we can look at the analyzer output plans to ensure they look right as well.

Was this patch authored or co-authored using generative AI tooling?

No

commit

commit
exclude flaky ThriftServerQueryTestSuite for new golden file
commit

commit

commit

commit
switch to expression

switch to expression

switch to expression

moving error checking to checkanalysis
@github-actions github-actions bot added the SQL label Sep 12, 2024
@dtenedor dtenedor changed the title [WIP][SPARK-49557][SQL] Add SQL pipe syntax for the WHERE operator [SPARK-49557][SQL] Add SQL pipe syntax for the WHERE operator Sep 16, 2024
@dtenedor dtenedor marked this pull request as ready for review September 16, 2024 15:22
@dtenedor
Copy link
Contributor Author

cc @cloud-fan @gengliangwang here is the WHERE operator, the next one. The implementation is relatively simple, I tried to think of as many test cases as possible.

}.getOrElse(Option(ctx.whereClause).map { c =>
// Add a table subquery boundary between the new filter and the input plan if one does not
// already exist. This helps the analyzer behave as if we had added the WHERE clause after a
// table subquery containing the input plan.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! This skips the tricky aggregate function pushdown stuff from Filter/Sort which complicates the analyzer quite a bit. We also don't need this with pipe syntax, as it's quite easy for users to filter on the aggregated query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that being said, seems like we don't need to add subquery alias if the child plan is UnresolvedRelation. We don't need to isolate the table scan node here.

Copy link
Contributor Author

@dtenedor dtenedor Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked offline and found that updating the UnresolvedRelation pattern match to this fixes the problem:

        case u: UnresolvedRelation =>
          u

In this way we don't add another redundant SubqueryAlias when ResolveRelations will already add one. Looking at the commit that performs this update, we see the analyzer plans improve accordingly.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, it also fixes a regression. We can add a test for table t |> where spark_catalog.default.t.x = 1, which didn't work before this fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, this is done.

@gengliangwang
Copy link
Member

Thanks, merging to master

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome feature!

-- Aggregations are allowed within expression subqueries in the pipe operator WHERE clause as long
-- no aggregate functions exist in the top-level expression predicate.
table t
|> where (select any_value(a) from other where x = a limit 1) = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also supports correlated subqueries!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants