-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: make parquet native scan schema case insensitive #1575
Conversation
sql("create table test (A long) using parquet options (path '" + path + "')") | ||
val df = sql("select A from test where A > 5") | ||
checkSparkAnswer(df) | ||
// TODO: pushed down filters do not used schema adapter in datafusion, will cause empty result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may need improvement in datafusion, possibly relevant code: https://github.com/apache/datafusion/blob/18feb8b2702b96a8a77ec4bc52fb67571e857d4d/datafusion/datasource-parquet/src/opener.rs#L86
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should fix this in the parquet reader (parquet by itself does not specify whether the field names are case sensitive/insensitive).
Comet is the right place to fix this, I feel.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1575 +/- ##
============================================
+ Coverage 56.12% 58.49% +2.37%
- Complexity 976 977 +1
============================================
Files 119 122 +3
Lines 11743 12231 +488
Branches 2251 2278 +27
============================================
+ Hits 6591 7155 +564
+ Misses 4012 3945 -67
+ Partials 1140 1131 -9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -110,6 +110,7 @@ fn get_options(session_timezone: &str) -> (TableParquetOptions, SparkParquetOpti | |||
let mut spark_parquet_options = | |||
SparkParquetOptions::new(EvalMode::Legacy, session_timezone, false); | |||
spark_parquet_options.allow_cast_unsigned_ints = true; | |||
spark_parquet_options.case_sensitive = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it makes sense to take the value from spark.sql.caseSensitive
although this is an internal config and false by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed there is a TODO comment above, maybe we can make them configurable in the future
// TODO: Maybe these are configs? |
Merged, thanks @wForget |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test and fix @wForget
sql("create table test (A long) using parquet options (path '" + path + "')") | ||
val df = sql("select A from test where A > 5") | ||
checkSparkAnswer(df) | ||
// TODO: pushed down filters do not used schema adapter in datafusion, will cause empty result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should fix this in the parquet reader (parquet by itself does not specify whether the field names are case sensitive/insensitive).
Comet is the right place to fix this, I feel.
Which issue does this PR close?
Part of #1574.
Rationale for this change
The data schema of spark parquet datasource scan may not be consistent with the actual schema of file. This may cause the
projection/filter
to not behave as expected.What changes are included in this PR?
This PR makes
data schema
andfile schema
case insensitive in schema adapter, but it does not affectpruning_predicate
andpage_pruning_predicate
.How are these changes tested?
added unit test