background goroutine get job info #4160

fscnick · 2025-10-30T14:54:55Z

Why are these changes needed?

Currently this pr is a draft for demonstrating the design. If it is okay, we could continue to polish it.

The operator query the job info with block operation. It might impact the efficiency of the reconciliation.

In this PR, it introduces the background goroutine to fetch the JobInfo and cache it.

implementation design:

When the dashboard client is initializing, it will also initialize the singleton of worker pool and cache storage along with a cache cleanup goroutine.
When GetJobInfo is called, it returns the cache if hit. Or, it put a placeholder and add a task to the background goroutine to update the JobInfo periodically.
The placeholder will be remove via calling StopJob. it might happen on retrying or deleting the RayJob.

Additionally, this pr takes the feedback in #4043 into account.

Related issue number

Closes #4087

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: fscnick <[email protected]>

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

fscnick · 2025-10-30T15:02:58Z

ray-operator/controllers/ray/rayjob_controller.go

+				if err := rayDashboardClient.StopJob(ctx, rayJobInstance.Status.JobId); err != nil {
+					logger.Error(err, "Failed to stop job for RayJob")
+				}


Is it okay to call StopJob to remove the cache placeholder before deleting the RayCluster because the status of retry calls deleteClusterResources?

We probably should not do this. Just let the cache client figure out how to deal with old entries by itself.

Fixed at 9f87da6. The the update of JobInfo would be remove from the updating loop if it reach the terminal status. Eventually, the cached JobInfo will be remove from cache as it elapses the expiration time. If it is not acceptable, kindly let me know.

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

fscnick · 2025-10-30T15:39:05Z

ray-operator/controllers/ray/utils/dashboardclient/dashboard_httpclient.go

 )

 type RayDashboardClientInterface interface {
-	InitClient(client *http.Client, dashboardURL string)


Remove this method from the interface because the different implementation might have different input arguments.

fscnick · 2025-10-30T15:40:48Z

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

+				keys := cacheStorage.Keys()
+				expiredThreshold := time.Now().Add(-cacheExpiry)
+				for _, key := range keys {
+					if cached, ok := cacheStorage.Peek(key); ok {


Peek wouldn't update the recent-ness of cache.

Signed-off-by: fscnick <[email protected]>

… for cache Signed-off-by: fscnick <[email protected]>

Signed-off-by: fscnick <[email protected]>

…oroutine-get-job-info

Signed-off-by: fscnick <[email protected]>

…oroutine-get-job-info

Future-Outlier

This is not an easy one, and I am very worried that this could cause issues—for example, goroutine leaks.

cc @seanlaii @win5923 @JiangJiaWei1103 @machichima @CheyuWu to review this

fscnick · 2025-12-11T14:22:16Z

ray-operator/controllers/ray/utils/util.go


 type ClientProvider interface {
-	GetDashboardClient(mgr manager.Manager) func(rayCluster *rayv1.RayCluster, url string) (dashboardclient.RayDashboardClientInterface, error)
+	GetDashboardClient(ctx context.Context, mgr manager.Manager) func(rayCluster *rayv1.RayCluster, url string) (dashboardclient.RayDashboardClientInterface, error)


Add the context to pass the logger and for graceful shutdown.

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

JiangJiaWei1103

This argument is used in many places. Rather than marking each one inline, I’d suggest applying the same change consistently across all usages. The marked examples show the intended direction. Thanks!

helm-chart/kuberay-operator/README.md

helm-chart/kuberay-operator/templates/deployment.yaml

ray-operator/controllers/ray/rayjob_controller.go

JiangJiaWei1103 · 2025-12-13T00:20:03Z

ray-operator/test/sampleyaml/support.go

 		g.Expect(err).ToNot(HaveOccurred())
 		url := fmt.Sprintf("127.0.0.1:%d", localPort)
-		rayDashboardClientFunc := utils.GetRayDashboardClientFunc(nil, false)
+		rayDashboardClientFunc := utils.GetRayDashboardClientFunc(t.Ctx(), nil, false, false)


Just curious. Do we need to test the following case in which non-blocking query is enabled?

rayDashboardClientFunc := utils.GetRayDashboardClientFunc(t.Ctx(), nil, false, true)

The test might be in the follow-up PR.

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

JiangJiaWei1103 · 2025-12-13T14:35:06Z

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

+	logger := ctrl.LoggerFrom(ctx).WithName("RayDashboardCacheClient")
+
+	cacheLock.RLock()
+	if cached, ok := cacheStorage.Get(jobId); ok {


Would it be reasonable to add error handling for cache operations, including Get(), PeekOrAdd(), and Add(), to avoid panic if cache initialization failed?

I don't think it's needed, as we always init the cache before calling this. If there's error, we should indeed panic as all other things will not function well

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

machichima · 2025-12-14T03:34:50Z

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

+
+		// expiry cache cleanup
+		go func() {
+			ticker := time.NewTicker(queryInterval * 10)


curious, why * 10 rather than use queryInterval only?

It seems reasonable to perform cache cleanup less frequently than the main query tasks. But, we still need to discuss how to configure these parameters for different use cases.

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

ray-operator/controllers/ray/utils/util.go

Signed-off-by: fscnick <[email protected]>

ray-operator/controllers/ray/utils/util.go

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

…rd client Signed-off-by: fscnick <[email protected]>

Signed-off-by: fscnick <[email protected]>

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

Signed-off-by: fscnick <[email protected]>

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

cursor · 2026-01-07T16:52:55Z

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

+								return
+							case w.taskQueue.In <- task:
+							}
+						})


Requeue timer may panic sending to closed channel

Medium Severity

The time.AfterFunc callback at line 79 schedules a task requeue after a delay. When this callback executes, if the context has been cancelled and chanx has closed the In channel, the select statement faces a race condition. Both ctx.Done() (closed) and the send to taskQueue.In (closed channel) are ready. If Go's runtime selects the send case, sending to a closed channel causes a panic. This is a distinct shutdown race from the Out channel issue, occurring when delayed requeue attempts happen after cleanup begins.

chanx doesn't close the In channel and the task executing a read seems disposable.

cursor · 2026-01-08T13:05:56Z

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

+						logger.Info("worker exiting from a closed channel", "workerID", workerID)
+						return
+					}
+					shouldRequeue := task(ctx)


Inverted channel receive condition breaks worker pool

High Severity

The channel receive condition is inverted. In Go, when receiving from a channel with value, ok := <-channel, ok is true when a value is successfully received and false when the channel is closed. The current code checks if ok and returns, meaning workers exit immediately after receiving a valid task. Additionally, when ok is false (channel closed), the code continues and calls task(ctx) where task is nil, which would cause a panic. The condition needs to be if !ok to correctly handle channel closure.

Fixed at ae8bf77

cursor · 2026-01-08T13:05:57Z

ray-operator/controllers/ray/utils/util.go

+			}
+			dashboardCachedClient := &dashboardclient.RayDashboardCacheClient{}
+			dashboardCachedClient.InitClient(ctx, namespacedName, dashboardClient)
+			return dashboardCachedClient, nil


Missing cluster identifier in cache key causes cross-cluster collisions

Medium Severity

When rayCluster is nil (as consistently happens in the apiserver), the namespacedName remains empty, causing all cache keys to share the same prefix. The apiserver calls GetJobInfo with nil rayCluster for different clusters. If two clusters have jobs with the same submission ID, the cache returns job info from the wrong cluster. The cache key formula namespacedName.String() + "/" + jobId produces identical keys like //job-123 for different clusters, leading to incorrect data being returned when the AsyncJobInfoQuery feature is enabled.

fixed at 5c7a5bb

Signed-off-by: fscnick <[email protected]>

rueian · 2026-01-09T12:23:56Z

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

+	case w.taskQueue.In <- task:
+		return nil
+	default:
+		return ErrTaskQueueTemporarilyUnavailable


Will this happen with the UnboundedChan?

It might happen at the beginning. #4160 (comment)

Shouldn’t we wait a bit for its internal buffer to be enlarged?

fixed at 7502bc8

rueian · 2026-01-09T12:26:01Z

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

+)
+
+func (w *workerPool) start(ctx context.Context, numWorkers int, requeueDelay time.Duration) {
+	logger := ctrl.LoggerFrom(ctx).WithName("RayDashboardCacheClient").WithName("WorkerPool")


Suggested change

logger := ctrl.LoggerFrom(ctx).WithName("RayDashboardCacheClient").WithName("WorkerPool")

logger := ctrl.LoggerFrom(ctx).WithName("WorkerPool")

The worker pool may be shared beyond RayDashboardCacheClients in the future.

Fixed at 8964c93

Signed-off-by: fscnick <[email protected]>

cursor · 2026-01-09T14:47:05Z

ray-operator/controllers/ray/utils/util.go

+			namespacedName := types.NamespacedName{
+				Name:      CheckName(rayCluster.Name),
+				Namespace: rayCluster.Namespace,
+			}


Cache key uses truncated name causing potential collisions

High Severity

The cache key for job info is created using CheckName(rayCluster.Name) instead of the raw cluster name. CheckName truncates names longer than 50 characters by removing characters from the beginning, and replaces the first character with 'r' if it's a digit or punctuation. This means different RayCluster names could produce identical cache keys - for example, two clusters with long names that differ only in their first 6+ characters would collide after truncation. This could cause one cluster's job info to be incorrectly returned for a different cluster, leading to wrong status reporting and potentially incorrect job state management.

Fixed at 90d2c30

cursor · 2026-01-09T14:47:05Z

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go

+	default:
+		return ErrTaskQueueTemporarilyUnavailable
+	}
+}


Non-blocking send to unbuffered channel often fails

Medium Severity

The AddTask function uses a non-blocking select with a default case to send tasks. However, the task queue is initialized with chanx.NewUnboundedChanSize(ctx, 0, 0, initBufferSize) where the first parameter (0) makes the In channel unbuffered. A non-blocking send to an unbuffered channel only succeeds if a receiver is actively waiting at that exact moment. Since the chanx internal goroutine may not always be ready to receive, the send will frequently fall through to the default case and return ErrTaskQueueTemporarilyUnavailable, causing unnecessary task drops and repeated retries even when the system is under low load.

Additional Locations (1)

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go#L65-L66

fixed at 7502bc8

Signed-off-by: fscnick <[email protected]>

cursor · 2026-01-11T01:41:29Z

ray-operator/controllers/ray/utils/util.go

+			}
+			dashboardCachedClient := &dashboardclient.RayDashboardCacheClient{}
+			dashboardCachedClient.InitClient(ctx, namespacedName, dashboardClient)
+			return dashboardCachedClient, nil


Cache key mismatch when RayCluster deleted before RayJob

Low Severity

When deleting a RayJob after its RayCluster is already deleted, the deletion path creates a RayDashboardCacheClient with an empty namespacedName. The condition rayCluster != nil evaluates to true for an empty struct pointer (created via &rayv1.RayCluster{}), so the cache client is initialized with empty Name and Namespace fields. When StopJob is called, it attempts to remove a cache key like "/jobId" instead of the original "namespace/clustername/jobId". This causes the original cache entry to remain stale until it expires (10 minutes) rather than being properly cleaned up.

Additional Locations (1)

ray-operator/controllers/ray/rayjob_controller.go#L118-L128

Fixed at 339adf3

Signed-off-by: fscnick <[email protected]>

background goroutine get job info (ray-project#4160)

fscnick added 7 commits October 30, 2025 22:04

[RayJob] background job info poc

805369a

[RayJob] add implement some methods

73b14b5

[RayJob] encapsulate the worker pool

4ce2381

[RayJob] replace concurrency map with lru cache

e184e5c

[RayJob] remove cache on stop and config flag

859f6a1

[RayJob] expiry cache cleanup goroutine

03ce0e9

Signed-off-by: fscnick <[email protected]>

[RayJob] code and comment minor fix

ac275c2

Signed-off-by: fscnick <[email protected]>

fscnick commented Oct 30, 2025

View reviewed changes

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go Outdated Show resolved Hide resolved

fscnick commented Oct 30, 2025

View reviewed changes

Future-Outlier self-assigned this Oct 31, 2025

fscnick added 8 commits November 1, 2025 22:14

[RayJob] task check contain or not befor add

0923ef5

Signed-off-by: fscnick <[email protected]>

[RayJob] remove delete cache from deleteClusterResources and add lock…

9f87da6

… for cache Signed-off-by: fscnick <[email protected]>

[Helm] add argument for useBackgroundGoroutine

97ab407

Signed-off-by: fscnick <[email protected]>

Merge remote-tracking branch 'upstream/master' into feat/background-g…

929a829

…oroutine-get-job-info

[RayJob] repeated error did not update

a2a0961

Signed-off-by: fscnick <[email protected]>

[RayJob] remove unused function and background goroutine observability

d2173bb

Signed-off-by: fscnick <[email protected]>

[RayJob] cache client support graceful shutdown

50c9b94

Signed-off-by: fscnick <[email protected]>

Merge remote-tracking branch 'upstream/master' into feat/background-g…

0bfd41e

…oroutine-get-job-info

Future-Outlier reviewed Dec 11, 2025

View reviewed changes

fscnick marked this pull request as ready for review December 11, 2025 14:19

fscnick requested review from MortalHappiness, andrewsykim and kevin85421 as code owners December 11, 2025 14:19

fscnick commented Dec 11, 2025

View reviewed changes

CheyuWu self-requested a review December 12, 2025 16:45

JiangJiaWei1103 reviewed Dec 13, 2025

View reviewed changes

helm-chart/kuberay-operator/README.md Outdated Show resolved Hide resolved

helm-chart/kuberay-operator/templates/deployment.yaml Outdated Show resolved Hide resolved

JiangJiaWei1103 reviewed Dec 13, 2025

View reviewed changes

machichima reviewed Dec 14, 2025

View reviewed changes

Future-Outlier moved this from can be merged to Todo in @Future-Outlier's kuberay project Jan 7, 2026

fscnick added 2 commits January 7, 2026 22:37

[RayJob] requeue check context has canceled or not

17b75d1

Signed-off-by: fscnick <[email protected]>

[RayJob] add cluster name on the cache key

67df232

Signed-off-by: fscnick <[email protected]>

cursor bot reviewed Jan 7, 2026

View reviewed changes

ray-operator/controllers/ray/utils/util.go Show resolved Hide resolved

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go Show resolved Hide resolved

fscnick added 2 commits January 7, 2026 23:50

[RayJob] check raycluster is nil or not when initializing the dashboa…

314df45

…rd client Signed-off-by: fscnick <[email protected]>

[RayJob] avoid send to a block channel when graceful shutdown

e3cbe9f

Signed-off-by: fscnick <[email protected]>

cursor bot reviewed Jan 7, 2026

View reviewed changes

ray-operator/controllers/ray/utils/dashboardclient/dashboard_cache_client.go Outdated Show resolved Hide resolved

[RayJob] use contain to check the placeholder at the beginning of task

f0a3b80

Signed-off-by: fscnick <[email protected]>

cursor bot reviewed Jan 7, 2026

View reviewed changes

[RayJob] graceful shutdown avoid panic from a nil task

e55c2e0

cursor bot reviewed Jan 8, 2026

View reviewed changes

fscnick added 2 commits January 8, 2026 21:19

[RayJob] fix channel receive condition

ae8bf77

Signed-off-by: fscnick <[email protected]>

[RayJob] fix nil rayCluster in dashboard cache client

5c7a5bb

Signed-off-by: fscnick <[email protected]>

rueian reviewed Jan 9, 2026

View reviewed changes

[RayJob] remove with name from log for sharing purpose

8964c93

Signed-off-by: fscnick <[email protected]>

cursor bot reviewed Jan 9, 2026

View reviewed changes

fscnick added 3 commits January 10, 2026 10:13

[RayJob] remove checkname to avoid collision

90d2c30

Signed-off-by: fscnick <[email protected]>

[RayJob] add task with blocking send

7502bc8

Signed-off-by: fscnick <[email protected]>

[RayJob] remove unused error

886c07f

Signed-off-by: fscnick <[email protected]>

cursor bot reviewed Jan 11, 2026

View reviewed changes

[RayJob] provide raycluster name if it is absent for removing cache

339adf3

Signed-off-by: fscnick <[email protected]>

rueian approved these changes Jan 11, 2026

View reviewed changes

rueian merged commit 79b5c30 into ray-project:master Jan 11, 2026
28 checks passed

github-project-automation bot moved this from to review to Done in @Future-Outlier's kuberay project Jan 11, 2026

github-project-automation bot moved this from In review to Done in My Kuberay & Ray Jan 11, 2026

DejusDevspace added a commit to DejusDevspace/kuberay that referenced this pull request Jan 11, 2026

Merge pull request #1 from ray-project/master

13ff7b5

background goroutine get job info (ray-project#4160)

DejusDevspace added a commit to DejusDevspace/kuberay that referenced this pull request Jan 11, 2026

Revert "background goroutine get job info (ray-project#4160)"

1d4c03e

JiangJiaWei1103 removed this from My Kuberay & Ray Jan 12, 2026

fscnick mentioned this pull request Jan 12, 2026

Feat/background goroutine get job info test #4368

Open

4 tasks

	logger := ctrl.LoggerFrom(ctx).WithName("RayDashboardCacheClient").WithName("WorkerPool")
	logger := ctrl.LoggerFrom(ctx).WithName("WorkerPool")

background goroutine get job info #4160

background goroutine get job info #4160

Conversation

fscnick commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

implementation design:

Related issue number

Checks

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fscnick Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JiangJiaWei1103 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 7, 2026

Choose a reason for hiding this comment

Requeue timer may panic sending to closed channel

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 8, 2026

Choose a reason for hiding this comment

Inverted channel receive condition breaks worker pool

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 8, 2026

Choose a reason for hiding this comment

Missing cluster identifier in cache key causes cross-cluster collisions

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fscnick commented Oct 30, 2025 •

edited

Loading

fscnick Oct 30, 2025 •

edited

Loading