Skip to content

Conversation

@Future-Outlier
Copy link
Member

@Future-Outlier Future-Outlier commented Jan 2, 2026

Co-authored-by: @chiayi [email protected]
Co-authored-by: @KunWuLuan [email protected]

Why are these changes needed?

This web server serves the history server's frontend and fetches data from the event server (processor).
As a follow-up, we should enable autoscaling for the web server using Kubernetes HPA.

https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/

Note: I combined code from this branch #4187 and this branch #4253, and then fixed a lot of bugs

Architecture

  ┌─────────────────────────────────────────────────────────────────────────────────┐
  │                              History Server                                      │
  │                                                                                  │
  │  ┌─────────────────────────────────────────────────────────────────────────┐    │
  │  │                         Router (CookieHandler)                           │    │
  │  │                                                                          │    │
  │  │    Check Cookie: session_name                                            │    │
  │  │                     │                                                    │    │
  │  │         ┌───────────┴───────────┐                                        │    │
  │  │         │                       │                                        │    │
  │  │      "live"              Other sessions                                  │    │
  │  │         │                (dead cluster)                                  │    │
  │  │         ▼                       │                                        │    │
  │  └─────────────────────────────────┴────────────────────────────────────────┘    │
  │            │                       │                                             │
  │            ▼                       ▼                                             │
  │  ┌──────────────────┐    ┌─────────────────────────────────────────────┐        │
  │  │  redirectRequest │    │            EventHandler                      │        │
  │  │                  │    │                                              │        │
  │  │  1. Query K8s    │    │  ┌──────────────┐  ┌──────────────────────┐ │        │
  │  │     for Service  │    │  │ClusterTaskMap│  │ ClusterActorMap      │ │        │
  │  │                  │    │  │              │  │                      │ │        │
  │  │  2. Proxy to     │    │  │ clusterA:    │  │ clusterA:            │ │        │
  │  │     {svc}:8265   │    │  │   TaskMap    │  │   ActorMap           │ │        │
  │  │                  │    │  │ clusterB:    │  │ clusterB:            │ │        │
  │  │                  │    │  │   TaskMap    │  │   ActorMap           │ │        │
  │  └────────┬─────────┘    │  └──────────────┘  └──────────────────────┘ │        │
  │           │              └─────────────────────────────────────────────┘        │
  │           │                       ▲                                             │
  │           │                       │ Populated from                              │
  │           │                       │                                             │
  │           │              ┌────────┴────────┐                                    │
  │           │              │  StorageReader  │                                    │
  │           │              │  (S3 Client)    │                                    │
  │           │              └────────┬────────┘                                    │
  └───────────┼───────────────────────┼─────────────────────────────────────────────┘
              │                       │
              ▼                       ▼
  ┌─────────────────────┐    ┌──────────────────┐
  │   Kubernetes        │    │   S3 Storage     │
  │                     │    │                  │
  │  ┌───────────────┐  │    │  job_events/     │
  │  │ RayCluster    │  │    │  node_events/    │
  │  │ Service:8265  │  │    │  logs/           │
  │  │ (Dashboard)   │  │    │                  │
  │  └───────────────┘  │    └──────────────────┘
  └─────────────────────┘

How to test and develop in your local env

  1. checkout this branch
  2. kind create cluster --image=kindest/node:v1.29.0
  3. build your ray-operator and run it (binary or deployment both work)
  4. kubectl apply -f historyserver/config/minio.yaml
  5. build collector and history server, and load them to your k8s cluster
    1. cd historyserver
    2. make localimage-historyserver;kind load docker-image historyserver:v0.1.0;
    3. make localimage-collector;kind load docker-image collector:v0.1.0;
  6. kubectl apply -f historyserver/config/raycluster.yaml
  7. kubectl apply -f historyserver/config/rayjob.yaml
  8. kubectl delete -f historyserver/config/raycluster.yaml
  9. kubectl apply -f historyserver/config/service_account.yaml
  10. kubectl apply -f config/historyserver.yaml;
  11. hit the historyserver's endpoint
    1. kubectl port-forward svc/historyserver 8080:30080
    2. curl -c cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/session_2026-01-06_07-07-00_383444_1"
    3. cat cookies.txt
    4. curl -b cookies.txt http://localhost:8080/api/v0/tasks
    5. note: you should change the session dir to the correct one, login to the minio console and get the right session
      1. ref: https://github.com/ray-project/kuberay/blob/master/historyserver/docs/set_up_collector.md#deploy-minio-for-log-and-event-storage
  12. (dead cluster) you can test the following endpoints
echo "=== Health Check ==="
curl "http://localhost:8080/readz"
curl "http://localhost:8080/livez"

echo "=== Clusters List ==="
curl "http://localhost:8080/clusters"

SESSION="session_2026-01-11_19-38-40_146706_1"
curl -c ~/cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/$SESSION"

echo "=== All Tasks ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks"

echo "=== Tasks by job_id ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks?filter_keys=job_id&filter_predicates==&filter_values=AgAAAA=="

echo "=== Task by task_id ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks?filter_keys=task_id&filter_predicates==&filter_values=Z6Loz6WgbbP///////////////8CAAAA"

echo "=== All Actors ==="
curl -b ~/cookies.txt "http://localhost:8080/logical/actors"


echo "=== Single Actor ==="
curl -b ~/cookies.txt "http://localhost:8080/logical/actors/YOUR_ACTOR_ID"

echo "=== Nodes ==="
curl -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq .

response

(ray) future@outlier ~ % curl "http://localhost:8080/readz"

ok%                                                                                                                                           
(ray) future@outlier ~ % curl "http://localhost:8080/livez"

ok%                                                                                                                                           
(ray) future@outlier ~ % curl "http://localhost:8080/clusters"

[
 {
  "name": "raycluster-historyserver",
  "namespace": "default",
  "sessionName": "live",
  "createTime": "2026-01-12 03:38:38 +0000 UTC",
  "createTimeStamp": 1768189118
 },
 {
  "name": "raycluster-historyserver",
  "namespace": "default",
  "sessionName": "session_2026-01-11_19-38-40_146706_1",
  "createTime": "2026-01-11T19:38:40Z",
  "createTimeStamp": 1768160320
 }
]%                                                                                                                                            
(ray) future@outlier ~ % SESSION="session_2026-01-11_19-38-40_146706_1"
curl -c ~/cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/$SESSION"
{
 "name": "raycluster-historyserver",
 "namespace": "default",
 "result": "success",
 "session": "session_2026-01-11_19-38-40_146706_1"
}%                                                                                                                                            
(ray) future@outlier ~ % curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks"

{"data":{"result":{"num_after_truncation":6,"num_filtered":6,"partial_failure_warning":"","result":[{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189145119,"error_message":"","error_type":"","func_or_class_name":"","job_id":"AgAAAA==","language":"PYTHON","name":"","node_id":"","placement_group_id":"////////////////////////","required_resources":{},"start_time":1768189144152,"state":"FINISHED","task_id":"//////////////////////////8CAAAA","type":"DRIVER_TASK","worker_id":""},{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189144604,"error_message":"","error_type":"","func_or_class_name":"","job_id":"","language":"","name":"","node_id":"JT5ujSf61SpU/VU7sx4xKWyLRlx4UQH4XQFv9Q==","placement_group_id":"","required_resources":null,"state":"FINISHED","task_id":"5cvZC38ft3bLqOkAxJpQJoFq7FUCAAAA","type":"","worker_id":"Ll3gk+SMLl61F9lqxrg8O9O1QgvR8oCZMmRqVw=="},{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189144604,"error_message":"","error_type":"","func_or_class_name":"","job_id":"","language":"","name":"","node_id":"JT5ujSf61SpU/VU7sx4xKWyLRlx4UQH4XQFv9Q==","placement_group_id":"","required_resources":null,"state":"FINISHED","task_id":"OQiL43NuWQrLqOkAxJpQJoFq7FUCAAAA","type":"","worker_id":"Ll3gk+SMLl61F9lqxrg8O9O1QgvR8oCZMmRqVw=="},{"actor_id":"","attempt_number":0,"call_site":"","error_message":"","error_type":"","func_or_class_name":"","job_id":"AQAAAA==","language":"PYTHON","name":"","node_id":"","placement_group_id":"////////////////////////","required_resources":{},"start_time":1768189143341,"state":"RUNNING","task_id":"//////////////////////////8BAAAA","type":"DRIVER_TASK","worker_id":""},{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189144526,"error_message":"","error_type":"","func_or_class_name":"","job_id":"AgAAAA==","language":"PYTHON","name":"Counter.__init__","node_id":"","placement_group_id":"////////////////////////","required_resources":{"CPU":0.5},"state":"FINISHED","task_id":"///////////LqOkAxJpQJoFq7FUCAAAA","type":"ACTOR_CREATION_TASK","worker_id":""},{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189144517,"error_message":"","error_type":"","func_or_class_name":"","job_id":"AgAAAA==","language":"PYTHON","name":"my_task","node_id":"JT5ujSf61SpU/VU7sx4xKWyLRlx4UQH4XQFv9Q==","placement_group_id":"////////////////////////","required_resources":{"CPU":0.5},"state":"FINISHED","task_id":"Z6Loz6WgbbP///////////////8CAAAA","type":"NORMAL_TASK","worker_id":"9twXgPJpU+82/ELG1tXdZwI6nm+PLhR/7U0Erw=="}],"total":6,"warnings":null}},"msg":"Tasks fetched.","result":true}%                                                                                              
(ray) future@outlier ~ % curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks?filter_keys=job_id&filter_predicates==&filter_values=AgAAAA=="     
{"data":{"result":{"num_after_truncation":3,"num_filtered":3,"partial_failure_warning":"","result":[{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189144517,"error_message":"","error_type":"","func_or_class_name":"","job_id":"AgAAAA==","language":"PYTHON","name":"my_task","node_id":"JT5ujSf61SpU/VU7sx4xKWyLRlx4UQH4XQFv9Q==","placement_group_id":"////////////////////////","required_resources":{"CPU":0.5},"state":"FINISHED","task_id":"Z6Loz6WgbbP///////////////8CAAAA","type":"NORMAL_TASK","worker_id":"9twXgPJpU+82/ELG1tXdZwI6nm+PLhR/7U0Erw=="},{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189145119,"error_message":"","error_type":"","func_or_class_name":"","job_id":"AgAAAA==","language":"PYTHON","name":"","node_id":"","placement_group_id":"////////////////////////","required_resources":{},"start_time":1768189144152,"state":"FINISHED","task_id":"//////////////////////////8CAAAA","type":"DRIVER_TASK","worker_id":""},{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189144526,"error_message":"","error_type":"","func_or_class_name":"","job_id":"AgAAAA==","language":"PYTHON","name":"Counter.__init__","node_id":"","placement_group_id":"////////////////////////","required_resources":{"CPU":0.5},"state":"FINISHED","task_id":"///////////LqOkAxJpQJoFq7FUCAAAA","type":"ACTOR_CREATION_TASK","worker_id":""}],"total":3,"warnings":null}},"msg":"Tasks fetched.","result":true}%                                                                                                                                    
(ray) future@outlier ~ % curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks?filter_keys=task_id&filter_predicates==&filter_values=Z6Loz6WgbbP///////////////8CAAAA"
{"data":{"result":{"num_after_truncation":1,"num_filtered":1,"partial_failure_warning":"","result":[{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189144517,"error_message":"","error_type":"","func_or_class_name":"","job_id":"AgAAAA==","language":"PYTHON","name":"my_task","node_id":"JT5ujSf61SpU/VU7sx4xKWyLRlx4UQH4XQFv9Q==","placement_group_id":"////////////////////////","required_resources":{"CPU":0.5},"state":"FINISHED","task_id":"Z6Loz6WgbbP///////////////8CAAAA","type":"NORMAL_TASK","worker_id":"9twXgPJpU+82/ELG1tXdZwI6nm+PLhR/7U0Erw=="}],"total":1,"warnings":null}},"msg":"Tasks fetched.","result":true}%                                                                              
(ray) future@outlier ~ % curl -b ~/cookies.txt "http://localhost:8080/logical/actors"

{"data":{"actors":{"XzQmbISIt3NHWhRCAQAAAA==":{"actor_class":"JobSupervisor","actor_id":"XzQmbISIt3NHWhRCAQAAAA==","address":{"ip_address":"10.244.0.44","node_id":"1JylYGjpDOh926RZufD3VgfpkSaR4kjvHcngDQ==","port":"","worker_id":"Q39Ebsr1+dYbQ93EmRYB994v1fvYVx75jipC+w=="},"call_site":"","end_time":1768189145211,"exit_details":"The actor is dead because its worker process has died. Worker exit type: INTENDED_USER_EXIT Worker exit detail: Worker exits by an user request. Worker exits with an exit code 0. exit_actor() is called.","is_detached":true,"job_id":"AQAAAA==","name":"_ray_internal_job_actor_rayjob-v6ghz","num_restarts":0,"pid":949,"placement_group_id":"","ray_namespace":"SUPERVISOR_ACTOR_RAY_NAMESPACE","repr_name":"","required_resources":{},"start_time":1768189143979,"state":"DEAD"},"y6jpAMSaUCaBauxVAgAAAA==":{"actor_class":"Counter","actor_id":"y6jpAMSaUCaBauxVAgAAAA==","address":{"ip_address":"10.244.0.45","node_id":"JT5ujSf61SpU/VU7sx4xKWyLRlx4UQH4XQFv9Q==","port":"","worker_id":"Ll3gk+SMLl61F9lqxrg8O9O1QgvR8oCZMmRqVw=="},"call_site":"","end_time":1768189145121,"exit_details":"The actor is dead because its owner has died. Owner Id: 02000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 10.244.0.44 Owner worker exit type: INTENDED_USER_EXIT Worker exit detail: Owner's worker process has crashed.","is_detached":false,"job_id":"AgAAAA==","name":"","num_restarts":0,"pid":273,"placement_group_id":"","ray_namespace":"d86c1e05-db84-44a6-827f-2b0d4d42c30c","repr_name":"","required_resources":{"CPU":0.5},"start_time":1768189144526,"state":"DEAD"}}},"msg":"All actors fetched.","result":true}%                                                                   
(ray) future@outlier ~ % curl -b ~/cookies.txt "http://localhost:8080/logical/actors/XzQmbISIt3NHWhRCAQAAAA=="
{
  "result": true,
  "msg": "Actor fetched.",
  "data": {
    "detail": {
      "actor_class": "JobSupervisor",
      "actor_id": "XzQmbISIt3NHWhRCAQAAAA==",
      "address": {
        "ip_address": "10.244.0.44",
        "node_id": "1JylYGjpDOh926RZufD3VgfpkSaR4kjvHcngDQ==",
        "port": "",
        "worker_id": "Q39Ebsr1+dYbQ93EmRYB994v1fvYVx75jipC+w=="
      },
      "call_site": "",
      "end_time": 1768189145211,
      "exit_details": "The actor is dead because its worker process has died. Worker exit type: INTENDED_USER_EXIT Worker exit detail: Worker exits by an user request. Worker exits with an exit code 0. exit_actor() is called.",
      "is_detached": true,
      "job_id": "AQAAAA==",
      "name": "_ray_internal_job_actor_rayjob-v6ghz",
      "num_restarts": 0,
      "pid": 949,
      "placement_group_id": "",
      "ray_namespace": "SUPERVISOR_ACTOR_RAY_NAMESPACE",
      "repr_name": "",
      "required_resources": {},
      "start_time": 1768189143979,
      "state": "DEAD"
    }
  }
}%                                                                                                                                            
(ray) future@outlier ~ % curl -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq .

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   290  100   290    0     0  13920      0 --:--:-- --:--:-- --:--:-- 14500
{
  "data": {
    "summary": [
      {
        "ip": "UNKNOWN",
        "raylet": {
          "nodeId": "253e6e8d27fad52a54fd553bb31e31296c8b465c785101f85d016ff5",
          "state": "ALIVE"
        }
      },
      {
        "ip": "UNKNOWN",
        "raylet": {
          "nodeId": "d49ca56068e90ce87ddba459b9f0f75607e9912691e248ef1dc9e00d",
          "state": "ALIVE"
        }
      }
    ]
  },
  "msg": "Node summary fetched.",
  "result": true
}
  1. (live cluster)
switch to live session first
echo "=== SWITCH to Live Session First ==="

echo "=== [LIVE] All Tasks ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks"

echo "=== [LIVE] Tasks by job_id ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks?filter_keys=job_id&filter_predicates==&filter_values=04000000

echo "=== [LIVE] Task Summarize ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks/summarize"

echo "=== [LIVE] All Actors ==="
curl -b ~/cookies.txt "http://localhost:8080/logical/actors"

echo "=== [LIVE] Single Actor ==="
curl -b ~/cookies.txt "http://localhost:8080/logical/actors/YOUR_ACTOR_ID"

echo "=== [LIVE] Nodes Summary ==="
curl -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq .

echo "=== [LIVE] Jobs ==="
curl -b ~/cookies.txt "http://localhost:8080/api/jobs/"

echo "=== [LIVE] Cluster Status ==="
curl -b ~/cookies.txt "http://localhost:8080/api/cluster_status"

image image image image image image image image image

Related issue number

#3966
#4374

HistoryServer Alpha Milestone Gap Analysis

Summary

Component Status Gap
Logs Collector (Sidecar) ✅ Done 0%
Events Collector (Sidecar) ✅ Done 0%
Storage Reader/Writer (S3 + OSS) ✅ Done 0%
History Server Container ✅ Done 0%
Event Processing (Task/Actor) ✅ Done 0%
Event Processing (Job/Node) ❌ Missing 100%
Live Cluster Redirect ✅ Done 0%
E2E Tests (Collector) ✅ Done 0%
E2E Tests (HistoryServer) ❌ Missing 100%

API Endpoints (Terminated Clusters)

Endpoint Status Notes
/clusters List all clusters
/nodes List nodes
/nodes/{node_id} Not implemented
/events Not implemented
/api/cluster_status Not implemented
/api/grafana_health Not implemented
/api/prometheus_health Not implemented
/api/data/datasets/{job_id} Not implemented
/api/serve/applications/ Not implemented
/api/v0/placement_groups/ Not implemented
/api/v0/tasks With filter support
/api/v0/tasks/summarize By func_name/lineage
/api/v0/logs List log files
/api/v0/logs/file Not implemented
/logical/actors With filter support
/logical/actors/{actor_id} Single actor
/api/jobs Needs Job events
/api/jobs/{job_id} Needs Job events

Remaining Work (Priority)

Priority Task
P0 Implement Job event processing
P0 Implement /api/jobs, /api/jobs/{job_id}
P0 Add HistoryServer E2E tests
P1 Implement /events endpoint
P1 Implement /nodes/{node_id}
P1 Implement /api/v0/logs/file
P2 /api/cluster_status
P2 /api/grafana_health, /api/prometheus_health
P2 /api/serve/applications/, /api/v0/placement_groups/

others:

  1. lineage for task endpoint
  2. write processed event to file system after all endpoints are supported.

Overall Progress: ~75%

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

chiayi and others added 3 commits December 3, 2025 10:58
Future-Outlier and others added 5 commits January 6, 2026 16:13
Copy link
Member Author

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @chiayi @KunWuLuan
to help review, thank you!

@Future-Outlier Future-Outlier added the P0 Critical issue that should be fixed ASAP label Jan 7, 2026
@Future-Outlier Future-Outlier changed the title [WIP][history server] Web Server [history server] Web Server + Event Processor Jan 7, 2026
@Future-Outlier Future-Outlier marked this pull request as ready for review January 7, 2026 03:40
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Comment on lines 10 to 24
const (
NIL TaskStatus = "NIL"
PENDING_ARGS_AVAIL TaskStatus = "PENDING_ARGS_AVAIL"
PENDING_NODE_ASSIGNMENT TaskStatus = "PENDING_NODE_ASSIGNMENT"
PENDING_OBJ_STORE_MEM_AVAIL TaskStatus = "PENDING_OBJ_STORE_MEM_AVAIL"
PENDING_ARGS_FETCH TaskStatus = "PENDING_ARGS_FETCH"
SUBMITTED_TO_WORKER TaskStatus = "SUBMITTED_TO_WORKER"
PENDING_ACTOR_TASK_ARGS_FETCH TaskStatus = "PENDING_ACTOR_TASK_ARGS_FETCH"
PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY TaskStatus = "PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY"
RUNNING TaskStatus = "RUNNING"
RUNNING_IN_RAY_GET TaskStatus = "RUNNING_IN_RAY_GET"
RUNNING_IN_RAY_WAIT TaskStatus = "RUNNING_IN_RAY_WAIT"
FINISHED TaskStatus = "FINISHED"
FAILED TaskStatus = "FAILED"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@win5923
Copy link
Collaborator

win5923 commented Jan 8, 2026

LGTM! Just a question you mentioned:

As a follow-up, we should enable autoscaling for the web server using Kubernetes HPA.

How does the event processor handle data consistency when multiple replicas are deployed? Currently each pod runs its own historyserver with in-memory state. Won't this cause inconsistent responses depending on which pod handles the request?

@Future-Outlier
Copy link
Member Author

Future-Outlier commented Jan 8, 2026

todo:

  1. support live clusters
  2. fix others endpoints like getTaskSummarize
  3. delete dead code
  4. solve cursor bug bot's review

@Future-Outlier
Copy link
Member Author

LGTM! Just a question you mentioned:

As a follow-up, we should enable autoscaling for the web server using Kubernetes HPA.

How does the event processor handle data consistency when multiple replicas are deployed? Currently each pod runs its own historyserver with in-memory state. Won't this cause inconsistent responses depending on which pod handles the request?

yes it will, and this will be solved in the beta version.
we will need to store processed events in the database.
good point, thank you!

@justinyeh1995
Copy link
Contributor

Thanks for the comprehensive instruction! Just ran through it and find it might be a good idea to add **/*.txt to the .gitignore file.

not sure the rule of adding .gitignore of this project, but right now I store cookie.txt under the root directory. I do something like this

SESSION="session_2026-01-10_06-52-41_947719_1"
curl -c ~/cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/$SESSION"

I see. Thanks for the tips!

@Future-Outlier
Copy link
Member Author

cc @chiayi @KunWuLuan to do a final pass, thank you!

Signed-off-by: Future-Outlier <[email protected]>
@Future-Outlier
Copy link
Member Author

cursor review

@Future-Outlier
Copy link
Member Author

cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no bugs!

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Copy link
Contributor

@chiayi chiayi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: Future-Outlier <[email protected]>
@KunWuLuan
Copy link
Contributor

LGTM

/approve

Mu sync.RWMutex
}

func (c *ClusterTaskMap) RLock() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need these funcs?

Copy link
Member Author

@Future-Outlier Future-Outlier Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go maps are not thread-safe. Concurrent read/write causes undefined behaviors, so we use locks.
https://go.dev/blog/maps#concurrency

┌─────────────────────┐     ┌─────────────────────┐
│   Event Processor   │     │    HTTP Handler     │
│   (goroutine 1..N)  │     │   (goroutine 1..M)  │
└──────────┬──────────┘     └──────────┬──────────┘
           │ WRITE                     │ READ
           ▼                           ▼
    ┌──────────────────────────────────────────┐
    │         ClusterTaskMap (RWMutex)         │
    │  ┌────────────────────────────────────┐  │
    │  │    TaskMap per cluster (Mutex)     │  │
    │  │  ┌──────────────────────────────┐  │  │
    │  │  │   map[taskId] → []Task       │  │  │
    │  │  └──────────────────────────────┘  │  │
    │  └────────────────────────────────────┘  │
    └──────────────────────────────────────────┘

@rueian rueian merged commit a9a4ab0 into ray-project:master Jan 13, 2026
29 checks passed
@github-project-automation github-project-automation bot moved this from can be merged to Done in @Future-Outlier's kuberay project Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P0 Critical issue that should be fixed ASAP

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants