Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: CometScanExec on Spark 3.5.2 #915

Merged
merged 7 commits into from
Sep 24, 2024
Merged

Conversation

Kimahriman
Copy link
Contributor

@Kimahriman Kimahriman commented Sep 5, 2024

Which issue does this PR close?

Closes #912

Rationale for this change

Fixes CometScanExec running on Spark 3.5.2+. Currently it will fail with a runtime exception, and will fail to compile if specifying 3.5.2 with

[ERROR] /Users/abinford/projects/arrow-datafusion-comet/spark/src/main/scala/org/apache/spark/sql/comet/CometScanExec.scala:144: value OP_ID_TAG is not a member of object org.apache.spark.sql.catalyst.plans.QueryPlan

This is because OP_ID_TAG is removed in Spark 3.5.2+, and the operator ID tracking is replaced with a separate internal map of plan -> ID, so there's no way to manually pass the ID on to a delegating plan. Instead simply copies the implementation of DataSourceScanExec's method.

What changes are included in this PR?

The only effect of the change is the verbose string output for CometScanExec. Instead of delegating to the underlying DataSourceScanExec, just copied the implementation over. This is in line with other Comet operators that implement their own verbose string, and makes more sense in the formatted explain as the operator names line up.

Before:

== Physical Plan ==
* ColumnarToRow (2)
+- CometScan parquet  (1)


(1) Scan parquet 
Output [20]: [_1#146, _2#147, _3#148, _4#149, _5#150L, _6#151, _7#152, _8#153, _9#154, _10#155, _11#156L, _12#157, _13#158, _14#159, _15#160, _16#161, _17#162, _18#163, _19#164, _20#165]
Batched: true
Location: InMemoryFileIndex [file:.../arrow-datafusion-comet/spark/target/tmp/spark-a0b56a3c-caa2-4631-ad15-ec83e8522948/test.parquet]
ReadSchema: struct<_1:boolean,_2:tinyint,_3:smallint,_4:int,_5:bigint,_6:float,_7:double,_8:string,_9:smallint,_10:int,_11:bigint,_12:decimal(20,0),_13:string,_14:binary,_15:decimal(5,2),_16:decimal(18,10),_17:decimal(38,37),_18:timestamp,_19:timestamp,_20:date>

(2) ColumnarToRow [codegen id : 1]
Input [20]: [_1#146, _2#147, _3#148, _4#149, _5#150L, _6#151, _7#152, _8#153, _9#154, _10#155, _11#156L, _12#157, _13#158, _14#159, _15#160, _16#161, _17#162, _18#163, _19#164, _20#165]

After:

== Physical Plan ==
* ColumnarToRow (2)
+- CometScan parquet  (1)


(1) CometScan parquet 
Output [20]: [_1#146, _2#147, _3#148, _4#149, _5#150L, _6#151, _7#152, _8#153, _9#154, _10#155, _11#156L, _12#157, _13#158, _14#159, _15#160, _16#161, _17#162, _18#163, _19#164, _20#165]
Batched: true
Location: InMemoryFileIndex [file:.../arrow-datafusion-comet/spark/target/tmp/spark-36667e9a-75e0-4f90-8442-5f47ecd1cf4b/test.parquet]
ReadSchema: struct<_1:boolean,_2:tinyint,_3:smallint,_4:int,_5:bigint,_6:float,_7:double,_8:string,_9:smallint,_10:int,_11:bigint,_12:decimal(20,0),_13:string,_14:binary,_15:decimal(5,2),_16:decimal(18,10),_17:decimal(38,37),_18:timestamp,_19:timestamp,_20:date>

(2) ColumnarToRow [codegen id : 1]
Input [20]: [_1#146, _2#147, _3#148, _4#149, _5#150L, _6#151, _7#152, _8#153, _9#154, _10#155, _11#156L, _12#157, _13#158, _14#159, _15#160, _16#161, _17#162, _18#163, _19#164, _20#165]

How are these changes tested?

Manually verified by building with Spark 3.5.2 ./mvnw clean package -Pspark-3.5 -Dspark.version=3.5.2 -DskipTests.

@Kimahriman
Copy link
Contributor Author

Oof this breaks a lot of explain plan comparison tests. If this change is ok I can try to work on updating them

@kazuyukitanimura
Copy link
Contributor

Is the change in the metadata? If so, should we fix the metadata instead?

@kazuyukitanimura
Copy link
Contributor

kazuyukitanimura commented Sep 11, 2024

And perhaps we can try to upgrade the Comet dependency to Spark 3.5.2 (separately)

@Kimahriman
Copy link
Contributor Author

Is the change in the metadata? If so, should we fix the metadata instead?

Are you referring to the compilation error or the change in explain output?

And perhaps we can try to upgrade the Comet dependency to Spark 3.5.2 (separately)

Agreed. I thought about doing that here but I wasn't sure the best way to go about updating the diff for the Spark SQL tests

@kazuyukitanimura
Copy link
Contributor

kazuyukitanimura commented Sep 13, 2024

Is the change in the metadata? If so, should we fix the metadata instead?

Are you referring to the compilation error or the change in explain output?

I meant CometScanExec.metadata inherited from DataSourceScanExec

@@ -141,8 +141,30 @@ case class CometScanExec(
if (wrapped == null) Map.empty else wrapped.metadata

override def verboseStringWithOperatorId(): String = {
getTagValue(QueryPlan.OP_ID_TAG).foreach(id => wrapped.setTagValue(QueryPlan.OP_ID_TAG, id))
wrapped.verboseStringWithOperatorId()
val metadataStr = metadata.toSeq.sorted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the original issue was that OP_ID_TAG has been removed, is it not sufficient to just remove the offending line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just removing setting the OP_ID_TAG leads to a formatted explain of:

== Physical Plan ==
* ColumnarToRow (2)
+- CometScan parquet  (1)


(unknown) Scan parquet 
Output [20]: [_1#0, _2#1, _3#2, _4#3, _5#4L, _6#5, _7#6, _8#7, _9#8, _10#9, _11#10L, _12#11, _13#12, _14#13, _15#14, _16#15, _17#16, _18#17, _19#18, _20#19]
Batched: true
Location: InMemoryFileIndex [file:/Users/abinford/projects/arrow-datafusion-comet/spark/target/tmp/spark-c0b82b7c-3de1-431b-96bf-56cb37f3a463/test.parquet]
ReadSchema: struct<_1:boolean,_2:tinyint,_3:smallint,_4:int,_5:bigint,_6:float,_7:double,_8:string,_9:smallint,_10:int,_11:bigint,_12:decimal(20,0),_13:string,_14:binary,_15:decimal(5,2),_16:decimal(18,10),_17:decimal(38,37),_18:timestamp,_19:timestamp,_20:date>

(2) ColumnarToRow [codegen id : 1]
Input [20]: [_1#0, _2#1, _3#2, _4#3, _5#4L, _6#5, _7#6, _8#7, _9#8, _10#9, _11#10L, _12#11, _13#12, _14#13, _15#14, _16#15, _17#16, _18#17, _19#18, _20#19]

which is the whole reason it was added in the first place, to fix the unknown operator ID bc35fa5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus by what I could see, every other operator is prefixed with Comet in the formatted explain, so it's weird for Scan to be the one thing that doesn't actually match up to the physical plan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense.
On an unrelated note, wonder why CometScanExec extends DataSourceScanExec instead of FileSourceScanExec (if it had we would have got the verboseString for free)

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @Kimahriman

@andygrove
Copy link
Member

@Kimahriman would you be able to rebase this PR so that we can merge it?

@Kimahriman
Copy link
Contributor Author

@Kimahriman would you be able to rebase this PR so that we can merge it?

Oof 95 conflicts I would have to manually resolve, let me just regenerate these plans all again tonight

@Kimahriman
Copy link
Contributor Author

Ok wasn't too bad to find/replace fix the issues again, we'll see if I messed anything up in the CI

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 70.58824% with 5 lines in your changes missing coverage. Please review.

Project coverage is 34.05%. Comparing base (fa275f1) to head (2757186).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...ala/org/apache/spark/sql/comet/CometScanExec.scala 70.58% 0 Missing and 5 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #915      +/-   ##
============================================
- Coverage     34.16%   34.05%   -0.12%     
+ Complexity      880      879       -1     
============================================
  Files           112      112              
  Lines         43286    43301      +15     
  Branches       9572     9578       +6     
============================================
- Hits          14789    14745      -44     
- Misses        25478    25518      +40     
- Partials       3019     3038      +19     
Flag Coverage Δ
34.05% <70.58%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Kimahriman
Copy link
Contributor Author

Done, I think CI failures are unrelated, looks like failures downloading dependencies in Hive tests

@andygrove
Copy link
Member

Done, I think CI failures are unrelated, looks like failures downloading dependencies in Hive tests

Thanks. I am re-running the failed jobs now.

@andygrove andygrove merged commit 5b3f7bc into apache:main Sep 24, 2024
74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CometScanExec fails on Spark 3.5.2
5 participants