[history server] Web Server + Event Processor #4329

Future-Outlier · 2026-01-02T07:23:09Z

Co-authored-by: @chiayi [email protected]
Co-authored-by: @KunWuLuan [email protected]

Why are these changes needed?

This web server serves the history server's frontend and fetches data from the event server (processor).
As a follow-up, we should enable autoscaling for the web server using Kubernetes HPA.

https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/

Note: I combined code from this branch #4187 and this branch #4253, and then fixed a lot of bugs

Architecture

  ┌─────────────────────────────────────────────────────────────────────────────────┐
  │                              History Server                                      │
  │                                                                                  │
  │  ┌─────────────────────────────────────────────────────────────────────────┐    │
  │  │                         Router (CookieHandler)                           │    │
  │  │                                                                          │    │
  │  │    Check Cookie: session_name                                            │    │
  │  │                     │                                                    │    │
  │  │         ┌───────────┴───────────┐                                        │    │
  │  │         │                       │                                        │    │
  │  │      "live"              Other sessions                                  │    │
  │  │         │                (dead cluster)                                  │    │
  │  │         ▼                       │                                        │    │
  │  └─────────────────────────────────┴────────────────────────────────────────┘    │
  │            │                       │                                             │
  │            ▼                       ▼                                             │
  │  ┌──────────────────┐    ┌─────────────────────────────────────────────┐        │
  │  │  redirectRequest │    │            EventHandler                      │        │
  │  │                  │    │                                              │        │
  │  │  1. Query K8s    │    │  ┌──────────────┐  ┌──────────────────────┐ │        │
  │  │     for Service  │    │  │ClusterTaskMap│  │ ClusterActorMap      │ │        │
  │  │                  │    │  │              │  │                      │ │        │
  │  │  2. Proxy to     │    │  │ clusterA:    │  │ clusterA:            │ │        │
  │  │     {svc}:8265   │    │  │   TaskMap    │  │   ActorMap           │ │        │
  │  │                  │    │  │ clusterB:    │  │ clusterB:            │ │        │
  │  │                  │    │  │   TaskMap    │  │   ActorMap           │ │        │
  │  └────────┬─────────┘    │  └──────────────┘  └──────────────────────┘ │        │
  │           │              └─────────────────────────────────────────────┘        │
  │           │                       ▲                                             │
  │           │                       │ Populated from                              │
  │           │                       │                                             │
  │           │              ┌────────┴────────┐                                    │
  │           │              │  StorageReader  │                                    │
  │           │              │  (S3 Client)    │                                    │
  │           │              └────────┬────────┘                                    │
  └───────────┼───────────────────────┼─────────────────────────────────────────────┘
              │                       │
              ▼                       ▼
  ┌─────────────────────┐    ┌──────────────────┐
  │   Kubernetes        │    │   S3 Storage     │
  │                     │    │                  │
  │  ┌───────────────┐  │    │  job_events/     │
  │  │ RayCluster    │  │    │  node_events/    │
  │  │ Service:8265  │  │    │  logs/           │
  │  │ (Dashboard)   │  │    │                  │
  │  └───────────────┘  │    └──────────────────┘
  └─────────────────────┘

How to test and develop in your local env

checkout this branch
kind create cluster --image=kindest/node:v1.29.0
build your ray-operator and run it (binary or deployment both work)
kubectl apply -f historyserver/config/minio.yaml
build collector and history server, and load them to your k8s cluster
1. cd historyserver
2. make localimage-historyserver;kind load docker-image historyserver:v0.1.0;
3. make localimage-collector;kind load docker-image collector:v0.1.0;
kubectl apply -f historyserver/config/raycluster.yaml
kubectl apply -f historyserver/config/rayjob.yaml
kubectl delete -f historyserver/config/raycluster.yaml
kubectl apply -f historyserver/config/service_account.yaml
kubectl apply -f config/historyserver.yaml;
hit the historyserver's endpoint
1. kubectl port-forward svc/historyserver 8080:30080
2. curl -c cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/session_2026-01-06_07-07-00_383444_1"
3. cat cookies.txt
4. curl -b cookies.txt http://localhost:8080/api/v0/tasks
5. note: you should change the session dir to the correct one, login to the minio console and get the right session
  1. ref: https://github.com/ray-project/kuberay/blob/master/historyserver/docs/set_up_collector.md#deploy-minio-for-log-and-event-storage
(dead cluster) you can test the following endpoints

echo "=== Health Check ==="
curl "http://localhost:8080/readz"
curl "http://localhost:8080/livez"

echo "=== Clusters List ==="
curl "http://localhost:8080/clusters"

SESSION="session_2026-01-11_19-38-40_146706_1"
curl -c ~/cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/$SESSION"

echo "=== All Tasks ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks"

echo "=== Tasks by job_id ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks?filter_keys=job_id&filter_predicates==&filter_values=AgAAAA=="

echo "=== Task by task_id ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks?filter_keys=task_id&filter_predicates==&filter_values=Z6Loz6WgbbP///////////////8CAAAA"

echo "=== All Actors ==="
curl -b ~/cookies.txt "http://localhost:8080/logical/actors"


echo "=== Single Actor ==="
curl -b ~/cookies.txt "http://localhost:8080/logical/actors/YOUR_ACTOR_ID"

echo "=== Nodes ==="
curl -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq .

response

(ray) future@outlier ~ % curl "http://localhost:8080/readz"

ok%                                                                                                                                           
(ray) future@outlier ~ % curl "http://localhost:8080/livez"

ok%                                                                                                                                           
(ray) future@outlier ~ % curl "http://localhost:8080/clusters"

[
 {
  "name": "raycluster-historyserver",
  "namespace": "default",
  "sessionName": "live",
  "createTime": "2026-01-12 03:38:38 +0000 UTC",
  "createTimeStamp": 1768189118
 },
 {
  "name": "raycluster-historyserver",
  "namespace": "default",
  "sessionName": "session_2026-01-11_19-38-40_146706_1",
  "createTime": "2026-01-11T19:38:40Z",
  "createTimeStamp": 1768160320
 }
]%                                                                                                                                            
(ray) future@outlier ~ % SESSION="session_2026-01-11_19-38-40_146706_1"
curl -c ~/cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/$SESSION"
{
 "name": "raycluster-historyserver",
 "namespace": "default",
 "result": "success",
 "session": "session_2026-01-11_19-38-40_146706_1"
}%                                                                                                                                            
(ray) future@outlier ~ % curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks"

{"data":{"result":{"num_after_truncation":6,"num_filtered":6,"partial_failure_warning":"","result":[{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189145119,"error_message":"","error_type":"","func_or_class_name":"","job_id":"AgAAAA==","language":"PYTHON","name":"","node_id":"","placement_group_id":"////////////////////////","required_resources":{},"start_time":1768189144152,"state":"FINISHED","task_id":"//////////////////////////8CAAAA","type":"DRIVER_TASK","worker_id":""},{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189144604,"error_message":"","error_type":"","func_or_class_name":"","job_id":"","language":"","name":"","node_id":"JT5ujSf61SpU/VU7sx4xKWyLRlx4UQH4XQFv9Q==","placement_group_id":"","required_resources":null,"state":"FINISHED","task_id":"5cvZC38ft3bLqOkAxJpQJoFq7FUCAAAA","type":"","worker_id":"Ll3gk+SMLl61F9lqxrg8O9O1QgvR8oCZMmRqVw=="},{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189144604,"error_message":"","error_type":"","func_or_class_name":"","job_id":"","language":"","name":"","node_id":"JT5ujSf61SpU/VU7sx4xKWyLRlx4UQH4XQFv9Q==","placement_group_id":"","required_resources":null,"state":"FINISHED","task_id":"OQiL43NuWQrLqOkAxJpQJoFq7FUCAAAA","type":"","worker_id":"Ll3gk+SMLl61F9lqxrg8O9O1QgvR8oCZMmRqVw=="},{"actor_id":"","attempt_number":0,"call_site":"","error_message":"","error_type":"","func_or_class_name":"","job_id":"AQAAAA==","language":"PYTHON","name":"","node_id":"","placement_group_id":"////////////////////////","required_resources":{},"start_time":1768189143341,"state":"RUNNING","task_id":"//////////////////////////8BAAAA","type":"DRIVER_TASK","worker_id":""},{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189144526,"error_message":"","error_type":"","func_or_class_name":"","job_id":"AgAAAA==","language":"PYTHON","name":"Counter.__init__","node_id":"","placement_group_id":"////////////////////////","required_resources":{"CPU":0.5},"state":"FINISHED","task_id":"///////////LqOkAxJpQJoFq7FUCAAAA","type":"ACTOR_CREATION_TASK","worker_id":""},{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189144517,"error_message":"","error_type":"","func_or_class_name":"","job_id":"AgAAAA==","language":"PYTHON","name":"my_task","node_id":"JT5ujSf61SpU/VU7sx4xKWyLRlx4UQH4XQFv9Q==","placement_group_id":"////////////////////////","required_resources":{"CPU":0.5},"state":"FINISHED","task_id":"Z6Loz6WgbbP///////////////8CAAAA","type":"NORMAL_TASK","worker_id":"9twXgPJpU+82/ELG1tXdZwI6nm+PLhR/7U0Erw=="}],"total":6,"warnings":null}},"msg":"Tasks fetched.","result":true}%                                                                                              
(ray) future@outlier ~ % curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks?filter_keys=job_id&filter_predicates==&filter_values=AgAAAA=="     
{"data":{"result":{"num_after_truncation":3,"num_filtered":3,"partial_failure_warning":"","result":[{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189144517,"error_message":"","error_type":"","func_or_class_name":"","job_id":"AgAAAA==","language":"PYTHON","name":"my_task","node_id":"JT5ujSf61SpU/VU7sx4xKWyLRlx4UQH4XQFv9Q==","placement_group_id":"////////////////////////","required_resources":{"CPU":0.5},"state":"FINISHED","task_id":"Z6Loz6WgbbP///////////////8CAAAA","type":"NORMAL_TASK","worker_id":"9twXgPJpU+82/ELG1tXdZwI6nm+PLhR/7U0Erw=="},{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189145119,"error_message":"","error_type":"","func_or_class_name":"","job_id":"AgAAAA==","language":"PYTHON","name":"","node_id":"","placement_group_id":"////////////////////////","required_resources":{},"start_time":1768189144152,"state":"FINISHED","task_id":"//////////////////////////8CAAAA","type":"DRIVER_TASK","worker_id":""},{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189144526,"error_message":"","error_type":"","func_or_class_name":"","job_id":"AgAAAA==","language":"PYTHON","name":"Counter.__init__","node_id":"","placement_group_id":"////////////////////////","required_resources":{"CPU":0.5},"state":"FINISHED","task_id":"///////////LqOkAxJpQJoFq7FUCAAAA","type":"ACTOR_CREATION_TASK","worker_id":""}],"total":3,"warnings":null}},"msg":"Tasks fetched.","result":true}%                                                                                                                                    
(ray) future@outlier ~ % curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks?filter_keys=task_id&filter_predicates==&filter_values=Z6Loz6WgbbP///////////////8CAAAA"
{"data":{"result":{"num_after_truncation":1,"num_filtered":1,"partial_failure_warning":"","result":[{"actor_id":"","attempt_number":0,"call_site":"","end_time":1768189144517,"error_message":"","error_type":"","func_or_class_name":"","job_id":"AgAAAA==","language":"PYTHON","name":"my_task","node_id":"JT5ujSf61SpU/VU7sx4xKWyLRlx4UQH4XQFv9Q==","placement_group_id":"////////////////////////","required_resources":{"CPU":0.5},"state":"FINISHED","task_id":"Z6Loz6WgbbP///////////////8CAAAA","type":"NORMAL_TASK","worker_id":"9twXgPJpU+82/ELG1tXdZwI6nm+PLhR/7U0Erw=="}],"total":1,"warnings":null}},"msg":"Tasks fetched.","result":true}%                                                                              
(ray) future@outlier ~ % curl -b ~/cookies.txt "http://localhost:8080/logical/actors"

{"data":{"actors":{"XzQmbISIt3NHWhRCAQAAAA==":{"actor_class":"JobSupervisor","actor_id":"XzQmbISIt3NHWhRCAQAAAA==","address":{"ip_address":"10.244.0.44","node_id":"1JylYGjpDOh926RZufD3VgfpkSaR4kjvHcngDQ==","port":"","worker_id":"Q39Ebsr1+dYbQ93EmRYB994v1fvYVx75jipC+w=="},"call_site":"","end_time":1768189145211,"exit_details":"The actor is dead because its worker process has died. Worker exit type: INTENDED_USER_EXIT Worker exit detail: Worker exits by an user request. Worker exits with an exit code 0. exit_actor() is called.","is_detached":true,"job_id":"AQAAAA==","name":"_ray_internal_job_actor_rayjob-v6ghz","num_restarts":0,"pid":949,"placement_group_id":"","ray_namespace":"SUPERVISOR_ACTOR_RAY_NAMESPACE","repr_name":"","required_resources":{},"start_time":1768189143979,"state":"DEAD"},"y6jpAMSaUCaBauxVAgAAAA==":{"actor_class":"Counter","actor_id":"y6jpAMSaUCaBauxVAgAAAA==","address":{"ip_address":"10.244.0.45","node_id":"JT5ujSf61SpU/VU7sx4xKWyLRlx4UQH4XQFv9Q==","port":"","worker_id":"Ll3gk+SMLl61F9lqxrg8O9O1QgvR8oCZMmRqVw=="},"call_site":"","end_time":1768189145121,"exit_details":"The actor is dead because its owner has died. Owner Id: 02000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 10.244.0.44 Owner worker exit type: INTENDED_USER_EXIT Worker exit detail: Owner's worker process has crashed.","is_detached":false,"job_id":"AgAAAA==","name":"","num_restarts":0,"pid":273,"placement_group_id":"","ray_namespace":"d86c1e05-db84-44a6-827f-2b0d4d42c30c","repr_name":"","required_resources":{"CPU":0.5},"start_time":1768189144526,"state":"DEAD"}}},"msg":"All actors fetched.","result":true}%                                                                   
(ray) future@outlier ~ % curl -b ~/cookies.txt "http://localhost:8080/logical/actors/XzQmbISIt3NHWhRCAQAAAA=="
{
  "result": true,
  "msg": "Actor fetched.",
  "data": {
    "detail": {
      "actor_class": "JobSupervisor",
      "actor_id": "XzQmbISIt3NHWhRCAQAAAA==",
      "address": {
        "ip_address": "10.244.0.44",
        "node_id": "1JylYGjpDOh926RZufD3VgfpkSaR4kjvHcngDQ==",
        "port": "",
        "worker_id": "Q39Ebsr1+dYbQ93EmRYB994v1fvYVx75jipC+w=="
      },
      "call_site": "",
      "end_time": 1768189145211,
      "exit_details": "The actor is dead because its worker process has died. Worker exit type: INTENDED_USER_EXIT Worker exit detail: Worker exits by an user request. Worker exits with an exit code 0. exit_actor() is called.",
      "is_detached": true,
      "job_id": "AQAAAA==",
      "name": "_ray_internal_job_actor_rayjob-v6ghz",
      "num_restarts": 0,
      "pid": 949,
      "placement_group_id": "",
      "ray_namespace": "SUPERVISOR_ACTOR_RAY_NAMESPACE",
      "repr_name": "",
      "required_resources": {},
      "start_time": 1768189143979,
      "state": "DEAD"
    }
  }
}%                                                                                                                                            
(ray) future@outlier ~ % curl -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq .

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   290  100   290    0     0  13920      0 --:--:-- --:--:-- --:--:-- 14500
{
  "data": {
    "summary": [
      {
        "ip": "UNKNOWN",
        "raylet": {
          "nodeId": "253e6e8d27fad52a54fd553bb31e31296c8b465c785101f85d016ff5",
          "state": "ALIVE"
        }
      },
      {
        "ip": "UNKNOWN",
        "raylet": {
          "nodeId": "d49ca56068e90ce87ddba459b9f0f75607e9912691e248ef1dc9e00d",
          "state": "ALIVE"
        }
      }
    ]
  },
  "msg": "Node summary fetched.",
  "result": true
}

(live cluster)

switch to live session first
echo "=== SWITCH to Live Session First ==="

echo "=== [LIVE] All Tasks ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks"

echo "=== [LIVE] Tasks by job_id ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks?filter_keys=job_id&filter_predicates==&filter_values=04000000

echo "=== [LIVE] Task Summarize ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks/summarize"

echo "=== [LIVE] All Actors ==="
curl -b ~/cookies.txt "http://localhost:8080/logical/actors"

echo "=== [LIVE] Single Actor ==="
curl -b ~/cookies.txt "http://localhost:8080/logical/actors/YOUR_ACTOR_ID"

echo "=== [LIVE] Nodes Summary ==="
curl -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq .

echo "=== [LIVE] Jobs ==="
curl -b ~/cookies.txt "http://localhost:8080/api/jobs/"

echo "=== [LIVE] Cluster Status ==="
curl -b ~/cookies.txt "http://localhost:8080/api/cluster_status"

Related issue number

#3966
#4374

HistoryServer Alpha Milestone Gap Analysis

Summary

Component	Status	Gap
Logs Collector (Sidecar)	✅ Done	0%
Events Collector (Sidecar)	✅ Done	0%
Storage Reader/Writer (S3 + OSS)	✅ Done	0%
History Server Container	✅ Done	0%
Event Processing (Task/Actor)	✅ Done	0%
Event Processing (Job/Node)	❌ Missing	100%
Live Cluster Redirect	✅ Done	0%
E2E Tests (Collector)	✅ Done	0%
E2E Tests (HistoryServer)	❌ Missing	100%

API Endpoints (Terminated Clusters)

Endpoint	Status	Notes
`/clusters`	✅	List all clusters
`/nodes`	✅	List nodes
`/nodes/{node_id}`	❌	Not implemented
`/events`	❌	Not implemented
`/api/cluster_status`	❌	Not implemented
`/api/grafana_health`	❌	Not implemented
`/api/prometheus_health`	❌	Not implemented
`/api/data/datasets/{job_id}`	❌	Not implemented
`/api/serve/applications/`	❌	Not implemented
`/api/v0/placement_groups/`	❌	Not implemented
`/api/v0/tasks`	✅	With filter support
`/api/v0/tasks/summarize`	✅	By func_name/lineage
`/api/v0/logs`	✅	List log files
`/api/v0/logs/file`	❌	Not implemented
`/logical/actors`	✅	With filter support
`/logical/actors/{actor_id}`	✅	Single actor
`/api/jobs`	❌	Needs Job events
`/api/jobs/{job_id}`	❌	Needs Job events

Remaining Work (Priority)

Priority	Task
P0	Implement Job event processing
P0	Implement `/api/jobs`, `/api/jobs/{job_id}`
P0	Add HistoryServer E2E tests
P1	Implement `/events` endpoint
P1	Implement `/nodes/{node_id}`
P1	Implement `/api/v0/logs/file`
P2	`/api/cluster_status`
P2	`/api/grafana_health`, `/api/prometheus_health`
P2	`/api/serve/applications/`, `/api/v0/placement_groups/`

others:

lineage for task endpoint
write processed event to file system after all endpoints are supported.

Overall Progress: ~75%

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Co-authored-by: chiayi [email protected] Co-authored-by: KunWuLuan [email protected]

Signed-off-by: Future-Outlier <[email protected]>

Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: KunWuLuan <[email protected]>

Signed-off-by: Future-Outlier <[email protected]>

Future-Outlier

cc @chiayi @KunWuLuan
to help review, thank you!

Signed-off-by: Future-Outlier <[email protected]>

historyserver/pkg/historyserver/router.go

historyserver/pkg/historyserver/reader.go

historyserver/pkg/historyserver/clientmanager.go

historyserver/pkg/historyserver/reader.go

historyserver/pkg/eventserver/eventserver.go

historyserver/pkg/historyserver/reader.go

historyserver/pkg/historyserver/router.go

…server

Signed-off-by: Future-Outlier <[email protected]>

historyserver/pkg/historyserver/router.go

historyserver/pkg/historyserver/server.go

Signed-off-by: Future-Outlier <[email protected]>

win5923 · 2026-01-08T16:17:31Z

historyserver/pkg/eventserver/types/task.go

+const (
+	NIL                                        TaskStatus = "NIL"
+	PENDING_ARGS_AVAIL                         TaskStatus = "PENDING_ARGS_AVAIL"
+	PENDING_NODE_ASSIGNMENT                    TaskStatus = "PENDING_NODE_ASSIGNMENT"
+	PENDING_OBJ_STORE_MEM_AVAIL                TaskStatus = "PENDING_OBJ_STORE_MEM_AVAIL"
+	PENDING_ARGS_FETCH                         TaskStatus = "PENDING_ARGS_FETCH"
+	SUBMITTED_TO_WORKER                        TaskStatus = "SUBMITTED_TO_WORKER"
+	PENDING_ACTOR_TASK_ARGS_FETCH              TaskStatus = "PENDING_ACTOR_TASK_ARGS_FETCH"
+	PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY TaskStatus = "PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY"
+	RUNNING                                    TaskStatus = "RUNNING"
+	RUNNING_IN_RAY_GET                         TaskStatus = "RUNNING_IN_RAY_GET"
+	RUNNING_IN_RAY_WAIT                        TaskStatus = "RUNNING_IN_RAY_WAIT"
+	FINISHED                                   TaskStatus = "FINISHED"
+	FAILED                                     TaskStatus = "FAILED"
+)


No GETTING_AND_PINNING_ARGS?
https://github.com/ray-project/ray/blob/9700991f2b212fb97f1aa8b9bf9b3bcd8e1fdb3b/src/ray/protobuf/common.proto#L919-L921

Signed-off-by: Future-Outlier <[email protected]>

historyserver/pkg/eventserver/eventserver.go

historyserver/pkg/historyserver/reader.go

win5923 · 2026-01-08T17:18:31Z

LGTM! Just a question you mentioned:

As a follow-up, we should enable autoscaling for the web server using Kubernetes HPA.

How does the event processor handle data consistency when multiple replicas are deployed? Currently each pod runs its own historyserver with in-memory state. Won't this cause inconsistent responses depending on which pod handles the request?

Future-Outlier · 2026-01-08T17:23:14Z

todo:

support live clusters
fix others endpoints like getTaskSummarize
delete dead code
solve cursor bug bot's review

Future-Outlier · 2026-01-08T17:24:17Z

LGTM! Just a question you mentioned:

As a follow-up, we should enable autoscaling for the web server using Kubernetes HPA.

How does the event processor handle data consistency when multiple replicas are deployed? Currently each pod runs its own historyserver with in-memory state. Won't this cause inconsistent responses depending on which pod handles the request?

yes it will, and this will be solved in the beta version.
we will need to store processed events in the database.
good point, thank you!

Signed-off-by: Future-Outlier <[email protected]>

historyserver/pkg/historyserver/router.go

justinyeh1995 · 2026-01-12T02:36:41Z

Thanks for the comprehensive instruction! Just ran through it and find it might be a good idea to add **/*.txt to the .gitignore file.

not sure the rule of adding .gitignore of this project, but right now I store cookie.txt under the root directory. I do something like this
SESSION="session_2026-01-10_06-52-41_947719_1"
curl -c ~/cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/$SESSION"

I see. Thanks for the tips!

…oxying Signed-off-by: Future-Outlier <[email protected]>

historyserver/pkg/historyserver/server.go

historyserver/pkg/eventserver/types/task.go

historyserver/pkg/historyserver/server.go

historyserver/pkg/storage/s3/s3.go

Signed-off-by: Future-Outlier <[email protected]>

Future-Outlier · 2026-01-12T04:31:29Z

cc @chiayi @KunWuLuan to do a final pass, thank you!

Signed-off-by: Future-Outlier <[email protected]>

historyserver/pkg/eventserver/eventserver.go

Signed-off-by: Future-Outlier <[email protected]>

Future-Outlier · 2026-01-12T08:01:35Z

cursor review

Signed-off-by: Future-Outlier <[email protected]>

Future-Outlier · 2026-01-12T08:04:27Z

cursor review

historyserver/pkg/historyserver/router.go

historyserver/Makefile

cursor

✅ Bugbot reviewed your changes and found no bugs!

Signed-off-by: Future-Outlier <[email protected]>

chiayi

LGTM!

historyserver/pkg/eventserver/eventserver.go

Signed-off-by: Future-Outlier <[email protected]>

KunWuLuan · 2026-01-13T06:44:18Z

LGTM

/approve

KunWuLuan · 2026-01-12T01:42:13Z

historyserver/pkg/eventserver/types/task.go

+	Mu             sync.RWMutex
+}
+
+func (c *ClusterTaskMap) RLock() {


Why we need these funcs?

Go maps are not thread-safe. Concurrent read/write causes undefined behaviors, so we use locks.
https://go.dev/blog/maps#concurrency

┌─────────────────────┐ ┌─────────────────────┐ │ Event Processor │ │ HTTP Handler │ │ (goroutine 1..N) │ │ (goroutine 1..M) │ └──────────┬──────────┘ └──────────┬──────────┘ │ WRITE │ READ ▼ ▼ ┌──────────────────────────────────────────┐ │ ClusterTaskMap (RWMutex) │ │ ┌────────────────────────────────────┐ │ │ │ TaskMap per cluster (Mutex) │ │ │ │ ┌──────────────────────────────┐ │ │ │ │ │ map[taskId] → []Task │ │ │ │ │ └──────────────────────────────┘ │ │ │ └────────────────────────────────────┘ │ └──────────────────────────────────────────┘

chiayi and others added 3 commits December 3, 2025 10:58

Add event server for history server.

3ce2df3

Co-authored-by: chiayi [email protected] Co-authored-by: KunWuLuan [email protected]

Update test

785df87

[history server] Web Server

13b9187

Signed-off-by: Future-Outlier <[email protected]>

Future-Outlier closed this Jan 2, 2026

Future-Outlier reopened this Jan 2, 2026

Future-Outlier and others added 5 commits January 6, 2026 16:13

add Kun Wu's setting

ba17941

Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: KunWuLuan <[email protected]>

Merge branch 'master' into historyserver-webserver

3dab9dc

Signed-off-by: Future-Outlier <[email protected]>

Merge branch 'historyserver-eventserver' into historyserver-webserver

f8c7214

Signed-off-by: Future-Outlier <[email protected]>

a worked version

72a9134

Signed-off-by: Future-Outlier <[email protected]>

a worked version, will revise it

11d6eda

Signed-off-by: Future-Outlier <[email protected]>

Future-Outlier commented Jan 7, 2026

View reviewed changes

Future-Outlier added the P0 Critical issue that should be fixed ASAP label Jan 7, 2026

Future-Outlier changed the title ~~[WIP][history server] Web Server~~ [history server] Web Server + Event Processor Jan 7, 2026

Future-Outlier marked this pull request as ready for review January 7, 2026 03:40

Trigger CI

4bd398c

Signed-off-by: Future-Outlier <[email protected]>

cursor bot reviewed Jan 7, 2026

View reviewed changes

Future-Outlier added 2 commits January 8, 2026 22:42

Merge remote-tracking branch 'upstream/master' into historyserver-web…

44cb52e

…server

merge master

3912d2f

Signed-off-by: Future-Outlier <[email protected]>

cursor bot reviewed Jan 8, 2026

View reviewed changes

historyserver/pkg/historyserver/router.go Outdated Show resolved Hide resolved

historyserver/pkg/historyserver/server.go Show resolved Hide resolved

turn chinese comments to english

f16a7e2

Signed-off-by: Future-Outlier <[email protected]>

win5923 reviewed Jan 8, 2026

View reviewed changes

fix bugs and make dead cluster endpoint work or return not yet supported

1524b44

Signed-off-by: Future-Outlier <[email protected]>

cursor bot reviewed Jan 8, 2026

View reviewed changes

historyserver/pkg/eventserver/eventserver.go Show resolved Hide resolved

historyserver/pkg/eventserver/eventserver.go Outdated Show resolved Hide resolved

historyserver/pkg/historyserver/reader.go Outdated Show resolved Hide resolved

chiayi mentioned this pull request Jan 8, 2026

[history server] Add initial version of event server for history server. #4253

Closed

4 tasks

Future-Outlier moved this to can be merged in @Future-Outlier's kuberay project Jan 9, 2026

Future-Outlier added this to @Future-Outlier's kuberay project Jan 9, 2026

support task summarize, not yet test live cluster

2b18bea

Signed-off-by: Future-Outlier <[email protected]>

cursor bot reviewed Jan 12, 2026

View reviewed changes

historyserver/pkg/historyserver/router.go Outdated Show resolved Hide resolved

fix Environment variable bypasses SSRF protection for live cluster pr…

a36304f

…oxying Signed-off-by: Future-Outlier <[email protected]>

cursor bot reviewed Jan 12, 2026

View reviewed changes

historyserver/pkg/historyserver/server.go Show resolved Hide resolved

historyserver/pkg/eventserver/types/task.go Outdated Show resolved Hide resolved

cursor bot reviewed Jan 12, 2026

View reviewed changes

historyserver/pkg/historyserver/server.go Outdated Show resolved Hide resolved

historyserver/pkg/storage/s3/s3.go Show resolved Hide resolved

historyserver/pkg/storage/s3/s3.go Show resolved Hide resolved

support required resources and server timeout error

28e11c0

Signed-off-by: Future-Outlier <[email protected]>

Future-Outlier added 4 commits January 12, 2026 13:05

better serviceaccount

9fca27d

Signed-off-by: Future-Outlier <[email protected]>

Add Readme

603da87

Signed-off-by: Future-Outlier <[email protected]>

Merge branch 'master' into historyserver-webserver

5970b2c

better comments for log dir path

a0319bc

Signed-off-by: Future-Outlier <[email protected]>

cursor bot reviewed Jan 12, 2026

View reviewed changes

historyserver/pkg/eventserver/eventserver.go Show resolved Hide resolved

fix race condition

e8aa0ad

Signed-off-by: Future-Outlier <[email protected]>

better const explaination for seperator connector

94135a2

Signed-off-by: Future-Outlier <[email protected]>

cursor bot reviewed Jan 12, 2026

View reviewed changes

historyserver/pkg/historyserver/router.go Show resolved Hide resolved

cursor bot reviewed Jan 12, 2026

View reviewed changes

historyserver/Makefile Outdated Show resolved Hide resolved

cursor bot reviewed Jan 12, 2026

View reviewed changes

Future-Outlier added 3 commits January 12, 2026 20:59

1 better actor response; 2 cleanup dead code

3f7e31f

Signed-off-by: Future-Outlier <[email protected]>

remove dead code

f195ed7

Signed-off-by: Future-Outlier <[email protected]>

update

a3126b4

Signed-off-by: Future-Outlier <[email protected]>

chiayi approved these changes Jan 13, 2026

View reviewed changes

historyserver/pkg/eventserver/eventserver.go Outdated Show resolved Hide resolved

fix comments

bf4576f

Signed-off-by: Future-Outlier <[email protected]>

KunWuLuan approved these changes Jan 13, 2026

View reviewed changes

rueian approved these changes Jan 13, 2026

View reviewed changes

rueian merged commit a9a4ab0 into ray-project:master Jan 13, 2026
29 checks passed

github-project-automation bot moved this from can be merged to Done in @Future-Outlier's kuberay project Jan 13, 2026

Future-Outlier mentioned this pull request Jan 13, 2026

[Epic][Feature][history server] Web Server + Event Processor #4374

Open

[history server] Web Server + Event Processor #4329

[history server] Web Server + Event Processor #4329

Uh oh!

Conversation

Future-Outlier commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Architecture

How to test and develop in your local env

Related issue number

HistoryServer Alpha Milestone Gap Analysis

Summary

API Endpoints (Terminated Clusters)

Remaining Work (Priority)

Overall Progress: ~75%

Checks

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

win5923 Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

win5923 commented Jan 8, 2026

Uh oh!

Future-Outlier commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Future-Outlier commented Jan 8, 2026

Uh oh!

Uh oh!

justinyeh1995 commented Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Future-Outlier commented Jan 12, 2026

Uh oh!

Uh oh!

Future-Outlier commented Jan 12, 2026

Uh oh!

Future-Outlier commented Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

chiayi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KunWuLuan commented Jan 13, 2026

Uh oh!

KunWuLuan Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Future-Outlier Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Future-Outlier commented Jan 2, 2026 •

edited

Loading

Future-Outlier commented Jan 8, 2026 •

edited

Loading

Future-Outlier Jan 13, 2026 •

edited

Loading