Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49556][SQL] Add SQL pipe syntax for the SELECT operator #48047

Closed
wants to merge 11 commits into from

Conversation

dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Sep 10, 2024

What changes were proposed in this pull request?

This PR adds SQL pipe syntax support for the SELECT operator.

For example:

CREATE TABLE t(x INT, y STRING) USING CSV;
INSERT INTO t VALUES (0, 'abc'), (1, 'def');

TABLE t
|> SELECT x, y

0	abc
1	def

TABLE t
|> SELECT x, y
|> SELECT x + LENGTH(y) AS z

3
4

(SELECT * FROM t UNION ALL SELECT * FROM t)
|> SELECT x + LENGTH(y) AS result

3
3
4
4

TABLE t
|> SELECT sum(x) AS result

Error: aggregate functions are not allowed in the pipe operator |> SELECT clause; please use the |> AGGREGATE clause instead

Why are the changes needed?

The SQL pipe operator syntax will let users compose queries in a more flexible fashion.

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

This PR adds a few unit test cases, but mostly relies on golden file test coverage. I did this to make sure the answers are correct as this feature is implemented and also so we can look at the analyzer output plans to ensure they look right as well.

Was this patch authored or co-authored using generative AI tooling?

No

commit

commit
@github-actions github-actions bot added the SQL label Sep 10, 2024
@dtenedor
Copy link
Contributor Author

Copy link
Contributor Author

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gengliangwang for your review!

Copy link
Contributor Author

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gengliangwang for your review again!

switch to expression

switch to expression

switch to expression

moving error checking to checkanalysis
Copy link
Contributor Author

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @gengliangwang for your careful attention

@@ -604,6 +604,7 @@ queryTerm
operator=INTERSECT setQuantifier? right=queryTerm #setOperation
| left=queryTerm {!legacy_setops_precedence_enabled}?
operator=(UNION | EXCEPT | SETMINUS) setQuantifier? right=queryTerm #setOperation
| left=queryTerm OPERATOR_PIPE operatorPipeRightSide #operatorPipeStatement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a hard time understanding this recursive parser rule. How does it match continuous pipe operators? And what is the Operator Precedence with mixed classic SQL query syntax and the new pipe syntax?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with ANTLR enough. So this recursive parser rule matches the SQL string from the end? e.g. it finds the first operatorPipeRightSide from the end, and then tries to match a chain of pipe operators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, no problem, I can try to explain it.

ANTLR tokenizes each SQL query it receives, converting the input string into a sequence of tokens (using SqlBaseLexer.g4). Then the parser's job (in this file) is to convert that sequence of tokens into an initial unresolved logical plan representing the parse tree.

To do so, the parser checks each rule in the listed sequence, one-by-one, comparing the provided tokens at the current index in the sequence with the required tokens from the rule. If the rule matches, wherein all keywords and other components in the rule map to corresponding input tokens, then the parser generates the rule's unresolved logical plan tree using the logic in AstBuilder.scala.

In this case, we define the new token OPERATOR_PIPE: '|>'; in SqlBaseLexer.g4. Then we add a new option to the existing queryTerm rule to allow any syntax matching an existing queryTerm to appear on the left side of this |> token and the syntax of operatorPipeRightSide on the right side (which in this PR is limited to only a selectClause).

ANTLR grammar allows left-recursive rules wherein any alternative may begin with a reference to the same rule, so the queryTerm on the left side may match any valid existing syntax for a queryTerm such as TABLE t, a table subquery, etc. Since we are extending queryTerm to also match against queryTerm OPERATOR_PIPE operatorPipeRightSide, this alternative implements the recursion wherein we may chain multiple pipe operators together. For example, in TABLE t |> SELECT x |> LIMIT 2, TABLE t matches a queryTerm, then TABLE t |> SELECT x matches another, and finally the entire query (using the new recursive #operatorPipeStatement alternative two times).

Otherwise, if the rule does not match, then the parser moves on to try the next rule in the sequence, and so on, similar to a Scala pattern-match. This defines the precedence of the rules amongst each other: the ones appearing first in the list in SqlBaseParser.g4 apply first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the parser generates a basic parse tree, and AstBuilder.scala transforms that into an unresolved logical plan? Thanks for the clear and detailed explanation! I'm adding SQL syntax too and this is very helpful.

respond to code review comments

respond to code review comments
Copy link
Contributor Author

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cloud-fan for your review!!

@@ -880,4 +881,20 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession {
parser.parsePlan("SELECT\u30001") // Unicode ideographic space
}
// scalastyle:on

test("Operator pipe SQL syntax") {
withSQLConf(SQLConf.OPERATOR_PIPE_SYNTAX_ENABLED.key -> "true") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried removing it, but the test failed :) it seems like SparkSqlParserSuite is not triggering Utils.isTesting for some reason. (It does seem to work for SQLQueryTestSuite.)

image

@@ -103,7 +103,8 @@ class ThriftServerQueryTestSuite extends SQLQueryTestSuite with SharedThriftServ
// SPARK-42921
"timestampNTZ/datetime-special-ansi.sql",
// SPARK-47264
"collations.sql"
"collations.sql",
"pipe-operators.sql"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it doesn't work in thrift-server?

Copy link
Contributor Author

@dtenedor dtenedor Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question; previously I found it was flaky and failing sometimes because it wasn't sorting the output result rows for some reason.

But it again now with a python script running it 25 times locally, it seems to be passing now. I can re-enable the test here.

Copy link
Contributor Author

@dtenedor dtenedor Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update on this, the test flaked out again on CI:

image

It seems this test suite is not manually sorting the output result rows in the absence of an ORDER BY clause, like the other SQL test suite variants. I'm adding it back to the blocklist for now.

Copy link
Contributor Author

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all for your reviews :)

@@ -880,4 +881,20 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession {
parser.parsePlan("SELECT\u30001") // Unicode ideographic space
}
// scalastyle:on

test("Operator pipe SQL syntax") {
withSQLConf(SQLConf.OPERATOR_PIPE_SYNTAX_ENABLED.key -> "true") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried removing it, but the test failed :) it seems like SparkSqlParserSuite is not triggering Utils.isTesting for some reason. (It does seem to work for SQLQueryTestSuite.)

image

@@ -103,7 +103,8 @@ class ThriftServerQueryTestSuite extends SQLQueryTestSuite with SharedThriftServ
// SPARK-42921
"timestampNTZ/datetime-special-ansi.sql",
// SPARK-47264
"collations.sql"
"collations.sql",
"pipe-operators.sql"
Copy link
Contributor Author

@dtenedor dtenedor Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question; previously I found it was flaky and failing sometimes because it wasn't sorting the output result rows for some reason.

But it again now with a python script running it 25 times locally, it seems to be passing now. I can re-enable the test here.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 017b0ea Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants