[SPARK-53491][SS] Fix exponential formatting of inputRowsPerSecond and processedRowsPerSecond in progress metrics JSON #52237

jayantdb · 2025-09-04T15:43:18Z

What changes were proposed in this pull request?

This PR fixes an issue where inputRowsPerSecond and processedRowsPerSecond in streaming progress metrics JSON
were displayed in scientific notation (e.g., 1.9871777605776876E8). The fix uses safe Decimal casting
to ensure values are displayed in a more human-readable format.

Results

Before change

{
  "id" : "9b512179-ea36-4b98-9d79-049d13813894",
  "runId" : "f85e2894-9582-493d-9b94-ce03e5490241",
  "name" : "TestFormatting",
  "timestamp" : "2025-09-04T10:57:02.897Z",
  "batchId" : 0,
  "batchDuration" : 1410,
  "numInputRows" : 900000,
  "inputRowsPerSecond" : 6.923076923076923E7,
  "processedRowsPerSecond" : 638297.8723404256,
  "durationMs" : {
    "addBatch" : 1101,
    "commitOffsets" : 157,
    "getBatch" : 0,
    "latestOffset" : 0,
    "queryPlanning" : 3,
    "triggerExecution" : 1410,
    "walCommit" : 149
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "MemoryStream[value#133]",
    "startOffset" : null,
    "endOffset" : 0,
    "latestOffset" : null,
    "numInputRows" : 900000,
    "inputRowsPerSecond" : 6.923076923076923E7,
    "processedRowsPerSecond" : 638297.8723404256
  } ],
  "sink" : {
    "description" : "MemorySink",
    "numOutputRows" : 900000
  }
}

After changes

{
  "id" : "03497c93-7ab7-4e14-ba5f-dadbfc8a4bf6",
  "runId" : "3933cdde-f99d-4a29-8bb8-d13bbb5df425",
  "name" : "TestFormatting",
  "timestamp" : "2025-09-04T15:50:45.500Z",
  "batchId" : 0,
  "batchDuration" : 1444,
  "numInputRows" : 900000,
  "inputRowsPerSecond" : 69230769.2,
  "processedRowsPerSecond" : 623268.7,
  "durationMs" : {
    "addBatch" : 1147,
    "commitOffsets" : 152,
    "getBatch" : 0,
    "latestOffset" : 0,
    "queryPlanning" : 3,
    "triggerExecution" : 1444,
    "walCommit" : 142
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "MemoryStream[value#133]",
    "startOffset" : null,
    "endOffset" : 0,
    "latestOffset" : null,
    "numInputRows" : 900000,
    "inputRowsPerSecond" : 69230769.2,
    "processedRowsPerSecond" : 623268.7
  } ],
  "sink" : {
    "description" : "MemorySink",
    "numOutputRows" : 900000
  }
}

Why are the changes needed?

Improves the readability of Spark Structured Streaming progress metrics JSON.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Run this Maven test:

./build/mvn -pl sql/core,sql/api \
-am test \
-DwildcardSuites=org.apache.spark.sql.streaming.StreamingQueryStatusAndProgressSuite \
-DwildcardTestName="SPARK-53491"

Results:

Run completed in 10 seconds, 680 milliseconds.
Total number of tests run: 12
Suites: completed 2, aborted 0
Tests: succeeded 12, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

Was this patch authored or co-authored using generative AI tooling?

No

…d processedRowsPerSecond in StreamProgressMetrics json

jayantdb · 2025-09-04T16:03:46Z

@anishshri-db could you please review this PR? Thanks!

vrozov · 2025-09-04T17:01:35Z

...ore/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryStatusAndProgressSuite.scala

@@ -400,6 +400,35 @@ class StreamingQueryStatusAndProgressSuite extends StreamTest with Eventually {
    assert(data(0).getAs[Timestamp](0).equals(validValue))
  }

+  test("SPARK-53491: `inputRowsPerSecond` and  `processedRowsPerSecond` " +


nit: what is the reason to use "`"?

vrozov · 2025-09-04T17:02:13Z

...ore/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryStatusAndProgressSuite.scala

+
+      print(progress)
+
+      assert(!(progress \ "inputRowsPerSecond").values.toString.contains("E"))


nit: will it be better to use matchers instead of assert?

… instead of assert

sql/api/src/main/scala/org/apache/spark/sql/streaming/progress.scala

anishshri-db · 2025-09-05T00:36:04Z

...ore/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryStatusAndProgressSuite.scala

+    val df = inputData.toDF()
+    val query = df.writeStream
+      .format("memory")
+      .queryName("TestFormatting")


nit: lets use a different name here ?

anishshri-db · 2025-09-05T00:36:28Z

...ore/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryStatusAndProgressSuite.scala

+
+      val progress = query.lastProgress.jsonValue
+
+      (progress \ "inputRowsPerSecond").values.toString should not include "E"


What is this doing exactly ? maybe add some comments ?

anishshri-db · 2025-09-05T00:37:43Z

@jayantdb - please look into CI failures here - https://github.com/jayantdb/spark/actions/runs/17472105263/job/49622983276 ?

anishshri-db · 2025-09-05T00:38:18Z

Can you also paste the new output for the progress metrics with your change ?

jayantdb · 2025-09-05T05:12:22Z

Can you also paste the new output for the progress metrics with your change ?

@jayantdb - please look into CI failures here - https://github.com/jayantdb/spark/actions/runs/17472105263/job/49622983276 ?

@anishshri-db The CI pipeline is failing due to Scala linter with this message:

Scalastyle checks passed.
The scalafmt check failed on sql/connect or sql/connect at following occurrences:

org.apache.maven.plugin.MojoExecutionException: Scalafmt: Unformatted files found
Error:  Failed to execute goal org.antipathy:mvn-scalafmt_2.13:1.1.1713302731.c3d0074:format (default-cli) on project spark-sql-api_2.13: Error formatting Scala files: Scalafmt: Unformatted files found -> [Help 1]

Before submitting your change, please make sure to format your code using the following command:
./build/mvn scalafmt:format -Dscalafmt.skip=false -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl sql/api -pl sql/connect/common -pl sql/connect/server -pl sql/connect/shims -pl sql/connect/client/jvm
Error: Process completed with exit code 1.

The reason seems to be due to the formatting in sql/connect packages.

Upon running the following check, I can see 1000+ files are marked as unformatted:

 ./build/mvn scalafmt:format \       
  -Dscalafmt.skip=false \
  -Dscalafmt.validateOnly=true \
  -Dscalafmt.changedOnly=false \
  -pl sql/core \

For example:

[INFO] - Requires formatting: PrunedScanSuite.scala
[INFO] - Requires formatting: ResolvedDataSourceSuite.scala
[INFO] - Requires formatting: DisableUnnecessaryBucketedScanSuite.scala
[INFO] - Requires formatting: SaveLoadSuite.scala
[INFO] - Requires formatting: DDLSourceLoadSuite.scala
[INFO] - Requires formatting: fakeExternalSources.scala
[INFO] - Requires formatting: InsertSuite.scala
[INFO] - Requires formatting: PathOptionSuite.scala
[INFO] - Requires formatting: TableScanSuite.scala
[INFO] - Requires formatting: BucketedWriteSuite.scala
[INFO] - Requires formatting: PartitionedWriteSuite.scala
[INFO] - Requires formatting: FiltersSuite.scala
[INFO] - Requires formatting: DataSourceAnalysisSuite.scala
[....]
[INFO] - Formatted: TPCBase.scala
[....]
[INFO] - Requires formatting: VariantShreddingSuite.scala
[INFO] - Requires formatting: DataFrameTableValuedFunctionsSuite.scala
[INFO] - Requires formatting: IntegratedUDFTestUtils.scala
[INFO] - Requires formatting: DeprecatedAPISuite.scala
[INFO] - Requires formatting: ReplaceIntegerLiteralsWithOrdinalsSqlSuite.scala
[INFO] - Requires formatting: SubquerySuite.scala
[INFO] - Requires formatting: DataFrameAggregateSuite.scala
[INFO] - Requires formatting: TPCHBase.scala
[ERROR] 
org.apache.maven.plugin.MojoExecutionException: Scalafmt: Unformatted files found
    at org.antipathy.mvn_scalafmt.FormatMojo.execute (FormatMojo.java:91)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:126)

I didn't touch any of these thousands of files, so I am unsure if I should do anything or not.

Kindly check and advise.

jayantdb · 2025-09-05T05:14:07Z

Can you also paste the new output for the progress metrics with your change ?

@anishshri-db , you can find the output of my code change at the comment in the JIRA: https://issues.apache.org/jira/browse/SPARK-53491

Pasting the output here as well for your reference:

{
  "id" : "03497c93-7ab7-4e14-ba5f-dadbfc8a4bf6",
  "runId" : "3933cdde-f99d-4a29-8bb8-d13bbb5df425",
  "name" : "TestFormatting",
  "timestamp" : "2025-09-04T15:50:45.500Z",
  "batchId" : 0,
  "batchDuration" : 1444,
  "numInputRows" : 900000,
  "inputRowsPerSecond" : 69230769.2,
  "processedRowsPerSecond" : 623268.7,
  "durationMs" : {
    "addBatch" : 1147,
    "commitOffsets" : 152,
    "getBatch" : 0,
    "latestOffset" : 0,
    "queryPlanning" : 3,
    "triggerExecution" : 1444,
    "walCommit" : 142
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "MemoryStream[value#133]",
    "startOffset" : null,
    "endOffset" : 0,
    "latestOffset" : null,
    "numInputRows" : 900000,
    "inputRowsPerSecond" : 69230769.2,
    "processedRowsPerSecond" : 623268.7
  } ],
  "sink" : {
    "description" : "MemorySink",
    "numOutputRows" : 900000
  }
}

vrozov · 2025-09-05T17:06:28Z

sql/api/src/main/scala/org/apache/spark/sql/streaming/progress.scala

+  def safeDecimalToJValue(value: Double): JValue = {
+    if (value.isNaN || value.isInfinity) {
+      JNothing
+    }


nit: while not enforced, single line} else { is more commonly used in Spark, AFAIK.

+1 - can u pls fix the formatting here

vrozov · 2025-09-05T17:13:34Z

sql/api/src/main/scala/org/apache/spark/sql/streaming/progress.scala

+  /** Convert BigDecimal to JValue while handling empty or infinite values */
+  def safeDecimalToJValue(value: Double): JValue = {
+    if (value.isNaN || value.isInfinity) {
+      JNothing


Is there a corresponding test case for isNaN?

…encies build check.

jayantdb · 2025-09-09T08:14:48Z

Can you also paste the new output for the progress metrics with your change ?

@jayantdb - please look into CI failures here - https://github.com/jayantdb/spark/actions/runs/17472105263/job/49622983276 ?

@anishshri-db The CI pipeline is failing due to Scala linter with this message:

Scalastyle checks passed.
The scalafmt check failed on sql/connect or sql/connect at following occurrences:

org.apache.maven.plugin.MojoExecutionException: Scalafmt: Unformatted files found
Error:  Failed to execute goal org.antipathy:mvn-scalafmt_2.13:1.1.1713302731.c3d0074:format (default-cli) on project spark-sql-api_2.13: Error formatting Scala files: Scalafmt: Unformatted files found -> [Help 1]

Before submitting your change, please make sure to format your code using the following command:
./build/mvn scalafmt:format -Dscalafmt.skip=false -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl sql/api -pl sql/connect/common -pl sql/connect/server -pl sql/connect/shims -pl sql/connect/client/jvm
Error: Process completed with exit code 1.

The reason seems to be due to the formatting in sql/connect packages.

Upon running the following check, I can see 1000+ files are marked as unformatted:

 ./build/mvn scalafmt:format \       
  -Dscalafmt.skip=false \
  -Dscalafmt.validateOnly=true \
  -Dscalafmt.changedOnly=false \
  -pl sql/core \

For example:

[INFO] - Requires formatting: PrunedScanSuite.scala
[INFO] - Requires formatting: ResolvedDataSourceSuite.scala
[INFO] - Requires formatting: DisableUnnecessaryBucketedScanSuite.scala
[INFO] - Requires formatting: SaveLoadSuite.scala
[INFO] - Requires formatting: DDLSourceLoadSuite.scala
[INFO] - Requires formatting: fakeExternalSources.scala
[INFO] - Requires formatting: InsertSuite.scala
[INFO] - Requires formatting: PathOptionSuite.scala
[INFO] - Requires formatting: TableScanSuite.scala
[INFO] - Requires formatting: BucketedWriteSuite.scala
[INFO] - Requires formatting: PartitionedWriteSuite.scala
[INFO] - Requires formatting: FiltersSuite.scala
[INFO] - Requires formatting: DataSourceAnalysisSuite.scala
[....]
[INFO] - Formatted: TPCBase.scala
[....]
[INFO] - Requires formatting: VariantShreddingSuite.scala
[INFO] - Requires formatting: DataFrameTableValuedFunctionsSuite.scala
[INFO] - Requires formatting: IntegratedUDFTestUtils.scala
[INFO] - Requires formatting: DeprecatedAPISuite.scala
[INFO] - Requires formatting: ReplaceIntegerLiteralsWithOrdinalsSqlSuite.scala
[INFO] - Requires formatting: SubquerySuite.scala
[INFO] - Requires formatting: DataFrameAggregateSuite.scala
[INFO] - Requires formatting: TPCHBase.scala
[ERROR] 
org.apache.maven.plugin.MojoExecutionException: Scalafmt: Unformatted files found
    at org.antipathy.mvn_scalafmt.FormatMojo.execute (FormatMojo.java:91)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:126)

I didn't touch any of these thousands of files, so I am unsure if I should do anything or not.

Kindly check and advise.

@anishshri-db
Update: All tests have now passed in the build.

vrozov

LGTM

anishshri-db · 2025-09-10T19:50:34Z

Would this break parsing for streaming query listeners that might be parsing these values ?

cc - @HeartSaVioR to confirm

...ore/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryStatusAndProgressSuite.scala

jayantdb · 2025-09-11T04:07:26Z

Would this break parsing for streaming query listeners that might be parsing these values?

cc - @HeartSaVioR to confirm

No @anishshri-db . It won't break anything.

I have kept the core implementation of inputRowsPerSecond and processedRowsPerSecond unchanged, and they will still be available as Double type because Double is a standard data type and Decimal is not.

Accessing using something like query.lastProgress.inputRowsPerSecond or query.lastProgress.processedRowsPerSecond will return double-type values in scientific notation. Downstream apps using these individual metrics (mostly through stream listeners) will get the same unchanged behavior, and they will be able to cast the exponential values to Decimal to deal with it.

…alidate accuracy.

HeartSaVioR

+1

Could you please still update the PR description to contain the specific event before the fix vs after the fix?

jayantdb · 2025-09-12T04:53:32Z

+1

Could you please still update the PR description to contain the specific event before the fix vs after the fix?

@HeartSaVioR .
Its done. I have added the results in Description.

HeartSaVioR · 2025-09-12T04:55:31Z

Thanks! Merging to master.

[SPARK-53491][SS] Fix exponential formatting of inputRowsPerSecond an…

f8da15c

…d processedRowsPerSecond in StreamProgressMetrics json

github-actions bot added SQL STRUCTURED STREAMING labels Sep 4, 2025

jayantdb changed the title ~~[SPARK-53491][SS] Fix exponential formatting of inputRowsPerSecond an…~~ [SPARK-53491][SS] Fix exponential formatting of inputRowsPerSecond and processedRowsPerSecond in progress metrics JSON Sep 4, 2025

vrozov reviewed Sep 4, 2025

View reviewed changes

[SPARK-53491][SS] Removed backquotes from test name and used Matchers…

383bef4

… instead of assert

anishshri-db reviewed Sep 5, 2025

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/streaming/progress.scala Show resolved Hide resolved

anishshri-db reviewed Sep 5, 2025

View reviewed changes

jayantdb added 2 commits September 5, 2025 11:28

[SPARK-53491][SS] Added comments in the test case, refactored the code.

2e7bf4c

[SPARK-53491][SS] Simplified the test case.

c1a63cc

vrozov reviewed Sep 5, 2025

View reviewed changes

jayantdb added 3 commits September 9, 2025 06:44

[SPARK-53491][SS] Fixed brackets formatting as per scalastyle.

d838002

[SPARK-53491][SS] Fixed brackets formatting as per scalastyle.

09ddad2

[SPARK-53491][SS] Reformatted for Run / Linters, licenses, and depend…

aac5caf

…encies build check.

jayantdb requested a review from anishshri-db September 9, 2025 08:17

vrozov approved these changes Sep 10, 2025

View reviewed changes

HeartSaVioR reviewed Sep 11, 2025

View reviewed changes

...ore/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryStatusAndProgressSuite.scala Show resolved Hide resolved

[SPARK-53491][SS] Added epsilon to compare floating-point values to v…

8ce3bfa

…alidate accuracy.

HeartSaVioR approved these changes Sep 12, 2025

View reviewed changes

HeartSaVioR closed this in 8178cb0 Sep 12, 2025


		print(progress)

		assert(!(progress \ "inputRowsPerSecond").values.toString.contains("E"))


		val progress = query.lastProgress.jsonValue

		(progress \ "inputRowsPerSecond").values.toString should not include "E"

[SPARK-53491][SS] Fix exponential formatting of inputRowsPerSecond and processedRowsPerSecond in progress metrics JSON #52237

[SPARK-53491][SS] Fix exponential formatting of inputRowsPerSecond and processedRowsPerSecond in progress metrics JSON #52237

Conversation

jayantdb commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Results

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

jayantdb commented Sep 4, 2025

Uh oh!

vrozov Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

vrozov Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anishshri-db Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

anishshri-db Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

anishshri-db commented Sep 5, 2025

Uh oh!

anishshri-db commented Sep 5, 2025

Uh oh!

jayantdb commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jayantdb commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrozov Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

anishshri-db Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

vrozov Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

jayantdb commented Sep 9, 2025

Uh oh!

vrozov left a comment

Choose a reason for hiding this comment

Uh oh!

anishshri-db commented Sep 10, 2025

Uh oh!

Uh oh!

jayantdb commented Sep 11, 2025

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

jayantdb commented Sep 12, 2025

Uh oh!

HeartSaVioR commented Sep 12, 2025

Uh oh!

Uh oh!

jayantdb commented Sep 4, 2025 •

edited

Loading

jayantdb commented Sep 5, 2025 •

edited

Loading

jayantdb commented Sep 5, 2025 •

edited

Loading