[SPARK-49723][SQL] Add Variant metrics to the JSON File Scan node #48172

harshmotw-db · 2024-09-19T21:50:47Z

What changes were proposed in this pull request?

This pull request adds the following metrics to JSON file scan nodes to collect metrics related to variants being constructed as part of the scan:

variant top-level - total count
variant top-level - total byte size
variant top-level - total number of paths
variant top-level - total number of scalar values
variant top-level - max depth
variant nested - total count
variant nested - total byte size
variant nested - total number of paths
variant nested - total number of scalar values
variant nested - max depth

Top level and nested variant metrics are separated as they can have different usage patterns. singleVariantColumn scans and columns in user-provided schema scans where the column type is a top level variant (not variant nested in a struct/array/map) are considered to be top level variants while variants nested in other data types are considered to be nested variants.

Why are the changes needed?

This change allows users to collect metrics on variant usage to better monitor their data/workloads.

Does this PR introduce any user-facing change?

Users will now be able to see variant metrics in JSON scan nodes which were not available earlier.

How was this patch tested?

Comprehensive unit tests in VariantEndToEndSuite.scala

Was this patch authored or co-authored using generative AI tooling?

Yes, got some help related to scala syntax.
Generated by: ChatGPT 4o, GitHub CoPilot.

common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java

cloud-fan · 2024-09-20T09:12:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+
+  /** Only report variant metrics if the data source file format is JSON */
+  override lazy val metrics: Map[String, SQLMetric] = super.metrics ++ {
+    if (relation.fileFormat.isInstanceOf[JsonFileFormat]) variantBuilderMetrics


does json scan always produce variants? If not we should only display these metrics when variant will be produced by the scan.

cloud-fan · 2024-09-20T09:13:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+    val readFile: (PartitionedFile) => Iterator[InternalRow] = {
+      val hadoopConf = relation.sparkSession.sessionState.newHadoopConfWithOptions(relation.options)
+      relation.fileFormat match {
+        case f: JsonFileFormat =>


We should probably make it more general and allow FileFormat implementations to report additional metrics.

Added variant metrics to JSON Scans

85f01af

github-actions bot added the SQL label Sep 19, 2024

minor change

a902d7f

HyukjinKwon reviewed Sep 19, 2024

View reviewed changes

common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java Outdated Show resolved Hide resolved

fixed indent

f628569

cloud-fan reviewed Sep 20, 2024

View reviewed changes

harshmotw-db requested a review from HyukjinKwon September 20, 2024 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49723][SQL] Add Variant metrics to the JSON File Scan node #48172

[SPARK-49723][SQL] Add Variant metrics to the JSON File Scan node #48172

harshmotw-db commented Sep 19, 2024

cloud-fan Sep 20, 2024

cloud-fan Sep 20, 2024

[SPARK-49723][SQL] Add Variant metrics to the JSON File Scan node #48172

Are you sure you want to change the base?

[SPARK-49723][SQL] Add Variant metrics to the JSON File Scan node #48172

Conversation

harshmotw-db commented Sep 19, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

cloud-fan Sep 20, 2024

Choose a reason for hiding this comment

cloud-fan Sep 20, 2024

Choose a reason for hiding this comment