[SPARK-51981][SS] Add JobTags to queryStartedEvent #50780

gjxdxh · 2025-05-01T23:38:36Z

What changes were proposed in this pull request?

Adding a new jobTags parameter for QueryStartedEvent so that it can be connected to the actual spark connect command that triggered this streaming. Also besides adding the parameter, a fix has been applied to the timestamp because previously json reads the wrong argument

Why are the changes needed?

Without this, there is no way to tell where does this streaming originate from.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test is added

Was this patch authored or co-authored using generative AI tooling?

No

sql/api/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenerSuite.scala

anishshri-db · 2025-05-02T22:55:00Z

@gjxdxh - just put SS in the PR title - thx !

gjxdxh · 2025-05-02T22:55:16Z

python/pyspark/sql/streaming/listener.py

+        jobTags = set()
+        javaIterator = jevent.jobTags().iterator()
+        while javaIterator.hasNext():
+            jobTags.add(javaIterator.next().toString())


Not sure what is the best way to convert a Java set object to a python set. I tried something like set(jevent.jobTags().toArray()) but it didn't work, this iterator approach seems to be working

anishshri-db · 2025-05-02T22:56:28Z

@HeartSaVioR - PTAL, thanks !

HyukjinKwon · 2025-05-06T23:12:01Z

sql/api/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala

      val name: String,
-      val timestamp: String)
+      val timestamp: String,
+      val jobTags: Set[String] = Set())


This isn't actually binary compatible. we should define another constructor.

Are you referring to we leave this constructor unchanged, and create a new constructor that has this new parameter? Do we have any examples on how to do it in a binary compatible way?

Do we have MiMa check for this class? If this is not binary compatible, it'd be ideal to let linter to fail.

Update the code to add another constructor to keep the original signature, let me know if that's the right approach. Also just to confirm, this would only be needed for scala code right?

This isn't compatible in Scala too ..

Do you mind sharing what is the way to make it compatible then? Should I move jobTags out of the primary constructor, and have it as a var instead?

I meant that adding a parameter with a default value isn't binary compatible .. it has to be with a new constructor. see https://github.com/databricks/scala-style-guide?tab=readme-ov-file#default-parameter-values

This isn't compatible with Scala too. Not only Java.

ohhhh it's private[sql]. Okay seems fine. The constructor is private. My bad :-).

No worries, I removed that default value anyways since it's not being used.

python/pyspark/sql/streaming/listener.py

HeartSaVioR

Looks OK to me except comments from @HyukjinKwon and the expected size of job tags.

HeartSaVioR · 2025-05-07T02:23:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

+          runId,
+          name,
+          progressReporter.formatTimestamp(startTimestamp),
+          sparkSession.sparkContext.getJobTags()


How many elements are we anticipating to have here? The size of event should be considerably small enough.

Yeah it should be quite small, we already have the job tags in other listener event already, for example here

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala

Line 57 in f643b64

jobTags: Set[String] = Set.empty,

Thanks for explanation.

HeartSaVioR · 2025-05-07T02:27:14Z

sql/api/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala

      val name: String,
-      val timestamp: String)
+      val timestamp: String,
+      val jobTags: Set[String] = Set())


Do we have MiMa check for this class? If this is not binary compatible, it'd be ideal to let linter to fail.

HeartSaVioR

+1

Looks like @HyukjinKwon 's comments are addressed, but I'll defer to him for the final review.

HyukjinKwon · 2025-05-07T23:19:07Z

python/pyspark/sql/streaming/listener.py

    @classmethod
    def fromJObject(cls, jevent: "JavaObject") -> "QueryStartedEvent":
+        job_tags = set()
+        java_iterator = jevent.jobTags().iterator()


no biggie but you can call set(jobTags().toList()) which will be automatically a Python list. Having individual Py4J call is actually pretty expensive. But tags are supposed to be few so I don't mind it. leave it to you with my approval.

I tried, but it didn't works. See the comment I put above #50780 (comment). I don't know why though, this is the tests result I got when running locally

lingkai.kong@K9WHXLR93K spark % python/run-tests --testnames 'pyspark.sql.tests.streaming.test_streaming_listener StreamingListenerTests.test_listener_events' Running PySpark tests. Output is in /Users/lingkai.kong/spark/python/unit-tests.log Will test against the following Python executables: ['python3.9'] Will test the following Python tests: ['pyspark.sql.tests.streaming.test_streaming_listener StreamingListenerTests.test_listener_events'] python3.9 python_implementation is CPython python3.9 version is: Python 3.9.6 Starting test(python3.9): pyspark.sql.tests.streaming.test_streaming_listener StreamingListenerTests.test_listener_events (temp output: /Users/lingkai.kong/spark/python/target/ed1f4b6e-6661-4815-84fd-00bf6cedd0ab/python3.9__pyspark.sql.tests.streaming.test_streaming_listener_StreamingListenerTests.test_listener_events__ck29eqqs.log) WARNING: Using incubator modules: jdk.incubator.vector Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). test_listener_events (pyspark.sql.tests.streaming.test_streaming_listener.StreamingListenerTests) ... FAIL ====================================================================== FAIL: test_listener_events (pyspark.sql.tests.streaming.test_streaming_listener.StreamingListenerTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/lingkai.kong/spark/python/pyspark/sql/tests/streaming/test_streaming_listener.py", line 413, in test_listener_events verify(TestListenerV1()) File "/Users/lingkai.kong/spark/python/pyspark/sql/tests/streaming/test_streaming_listener.py", line 396, in verify self.check_start_event(start_event) File "/Users/lingkai.kong/spark/python/pyspark/sql/tests/streaming/test_streaming_listener.py", line 40, in check_start_event self.assertTrue(isinstance(event, QueryStartedEvent)) AssertionError: False is not true ---------------------------------------------------------------------- Ran 1 test in 5.903s FAILED (failures=1) Had test failures in pyspark.sql.tests.streaming.test_streaming_listener StreamingListenerTests.test_listener_events with python3.9; see logs.

Could this be a issue with regarding to package version etc?

ah okie that's fine

Sg, thanks! Can you help me merge this PR once the tests pass?

HyukjinKwon · 2025-05-08T22:55:22Z

Merged to master.

### What changes were proposed in this pull request? Adding a new jobTags parameter for QueryStartedEvent so that it can be connected to the actual spark connect command that triggered this streaming. Also besides adding the parameter, a fix has been applied to the timestamp because previously json reads the wrong argument ### Why are the changes needed? Without this, there is no way to tell where does this streaming originate from. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test is added ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50780 from gjxdxh/lingkai-kong_data/SPARK-51981. Authored-by: Lingkai Kong <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

Add JobTags to queryStartedEvent

ed1ac7e

github-actions bot added SQL STRUCTURED STREAMING labels May 1, 2025

anishshri-db reviewed May 2, 2025

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala Outdated Show resolved Hide resolved

anishshri-db reviewed May 2, 2025

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenerSuite.scala Show resolved Hide resolved

github-actions bot added the PYTHON label May 2, 2025

anishshri-db approved these changes May 2, 2025

View reviewed changes

gjxdxh commented May 2, 2025

View reviewed changes

gjxdxh changed the title ~~[SPARK-51981][Strcutured Streaming]Add JobTags to queryStartedEvent~~ [SPARK-51981][Strcutured Streaming][SS]Add JobTags to queryStartedEvent May 2, 2025

Address comments + add support for pyspark

14406b5

gjxdxh force-pushed the lingkai-kong_data/SPARK-51981 branch from c3306ba to 14406b5 Compare May 5, 2025 21:01

HyukjinKwon changed the title ~~[SPARK-51981][Strcutured Streaming][SS]Add JobTags to queryStartedEvent~~ [SPARK-51981][SS] Add JobTags to queryStartedEvent May 6, 2025

HyukjinKwon reviewed May 6, 2025

View reviewed changes

python/pyspark/sql/streaming/listener.py Outdated Show resolved Hide resolved

HeartSaVioR reviewed May 7, 2025

View reviewed changes

gjxdxh requested review from HeartSaVioR and HyukjinKwon May 7, 2025 18:25

update based on comments

fe439b8

gjxdxh force-pushed the lingkai-kong_data/SPARK-51981 branch from 7dba6ae to fe439b8 Compare May 7, 2025 20:38

HeartSaVioR approved these changes May 7, 2025

View reviewed changes

HyukjinKwon approved these changes May 7, 2025

View reviewed changes

HyukjinKwon reviewed May 7, 2025

View reviewed changes

remove default value

5570794

HyukjinKwon closed this in 271cd0d May 8, 2025

[SPARK-51981][SS] Add JobTags to queryStartedEvent #50780

[SPARK-51981][SS] Add JobTags to queryStartedEvent #50780

Uh oh!

Conversation

gjxdxh commented May 1, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

anishshri-db commented May 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anishshri-db commented May 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants