Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51097] [SS] Re-introduce RocksDB state store's last uploaded snapshot version instance metrics #50195

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

zecookiez
Copy link
Contributor

@zecookiez zecookiez commented Mar 6, 2025

What changes were proposed in this pull request?

SPARK-51097

#50161 recently had to revert the changes in #49816 due to instance metrics showing up on SparkUI, causing excessive clutter. This PR aims to re-introduce the new instance metrics while incorporating the fix in #50157.

The main difference between this and the original PR are:

  • Distinguishing metrics and instance metrics as a whole, so that SparkPlan does not accidentally pick up hundreds of instance metrics
  • Adding a test to the suite to verify that no nodes in the execution plan contain the instance metrics we just introduced. Since SparkPlan nodes store metrics as strings, distinguishing between metric types isn't possible in this context. Thus, this test will only check for the specific metric we just introduced.
  • Small edit to another test to verify instance metric behavior as well (see here)

Line-by-line:

Before the fix (note that metrics are sorted lexicographically):
image

After including the fix:
image

Why are the changes needed?

From #49816:
There's currently a lack of observability into state store specific maintenance information, notably metrics of the last snapshot version uploaded. This affects the ability to identify performance degradation issues behind maintenance tasks and more as described in SPARK-51097.

Does this PR introduce any user-facing change?

From #49816:
There will be some new metrics displayed from StreamingQueryProgress:

Streaming query made progress: {
  ...
  "stateOperators" : [ {
    ...
    "customMetrics" : {
      ...
      "SnapshotLastUploaded.partition_0_default" : 2,
      "SnapshotLastUploaded.partition_12_default" : 10,
      "SnapshotLastUploaded.partition_8_default" : 10,
      ...
    }
  } ],
  "sources" : ...,
  "sink" : ...
}

How was this patch tested?

Five tests are added to RocksDBStateStoreIntegrationSuite:

I additionally manually verified these metrics on SparkUI (refer to screenshots above with and without the fix)

Was this patch authored or co-authored using generative AI tooling?

No.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant