Skip to content

Conversation

@Future-Outlier
Copy link
Member

@Future-Outlier Future-Outlier commented Jan 5, 2026

Summary:
when developing historyserver, please

  1. use this branch to rebulid your collector image
  2. use this branch's yaml file for raycluster and rayjob

Why are these changes needed?

Before

Job events were not collected because getJobID only checked top-level jobId.
Ray's export API returns job-related events with jobId nested inside event type objects.

After

Fix getJobID function to correctly extract jobId from nested event structures, enabling job event collection in history server.

What did I do?

Changes

  1. code: Check jobId in nested event types: driverJobDefinitionEvent, driverJobLifecycleEvent, taskDefinitionEvent, taskLifecycleEvent, actorTaskDefinitionEvent, actorDefinitionEvent, taskProfileEvents [1]
  2. raycluster CR: Add RAY_enable_ray_event=true in the example YAML to enable all events [2]
  3. raycluster CR: Add RAY_DASHBOARD_AGGREGATOR_AGENT_EXPOSABLE_EVENT_TYPES in the example YAML to enable all events [2]

How I tested it

  1. Deploy history server collector with a RayCluster
  2. Submit a Ray job to the cluster (including task and actor)
  3. Verify job events are collected and stored in the storage backend (S3)
  4. Download all files from dir job_events and node_events files from S3 and verify all 9 event types are present [3]

References

[1] Ray event proto definitions:

[2]

  1. Ray config to enable export API: ray_config_def.h#L542

  2. 2026.01.06 dialogue from @sampan-s-nayak

RAY_DASHBOARD_AGGREGATOR_AGENT_EXPOSABLE_EVENT_TYPES is used in older releases (upto 2.52)
RAY_DASHBOARD_AGGREGATOR_AGENT_PUBLISHER_HTTP_ENDPOINT_EXPOSABLE_EVENT_TYPES will be used going forward (need to see if this is part of 2.53)

[3]
those files are here:
session_2026-01-05_19-49-34_480793_1.zip

image image
# eventType nestedKey Has jobId? AQAAAA== AgAAAA== node_events Total
1 DRIVER_JOB_DEFINITION_EVENT driverJobDefinitionEvent ✅ Yes 0 1 0 1
2 DRIVER_JOB_LIFECYCLE_EVENT driverJobLifecycleEvent ✅ Yes 0 2 0 2
3 TASK_DEFINITION_EVENT taskDefinitionEvent ✅ Yes 1 3 0 4
4 TASK_LIFECYCLE_EVENT taskLifecycleEvent ⚠️ Sometimes empty 0 4 2 6
5 TASK_PROFILE_EVENT taskProfileEvents ✅ Yes 3 1 0 4
6 ACTOR_DEFINITION_EVENT actorDefinitionEvent ✅ Yes 1 1 0 2
7 ACTOR_TASK_DEFINITION_EVENT actorTaskDefinitionEvent ✅ Yes 0 2 0 2
8 NODE_DEFINITION_EVENT nodeDefinitionEvent ❌ No 0 0 2 2
9 NODE_LIFECYCLE_EVENT nodeLifecycleEvent ❌ No 0 0 3 3
10 ACTOR_LIFECYCLE_EVENT actorLifecycleEvent ❌ No 0 0 4 4
Total 5 14 11 30

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@Future-Outlier Future-Outlier added the P0 Critical issue that should be fixed ASAP label Jan 5, 2026
Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Jia-Wei Jiang <[email protected]>
Copy link
Member Author

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is super urgent, we need to merge this ASAP to unblock others to develop history server, and this is related to @chiayi 's event processor.

cc @rueian @andrewsykim plz merge, thank you!

Signed-off-by: Future-Outlier <[email protected]>
Comment on lines -411 to -412
if jobID, hasJob := eventData["jobId"]; hasJob && jobID != "" {
return fmt.Sprintf("%v", jobID)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ray event structure never contains jobId at the top level of the event data.

Looking at actual Ray events, the structure is:

{
  "eventId": "...",
  "eventType": "TASK_DEFINITION_EVENT",
  "message": "",
  "sessionName": "...",
  "severity": "INFO",
  "sourceType": "CORE_WORKER",
  "timestamp": "...",
  "taskDefinitionEvent": {       // <-- Nested event object
    "jobId": "AgAAAA==",         // <-- jobId is ALWAYS here (nested)
    "language": "PYTHON",
    ...
  }
}

The jobId field is always nested inside the specific event type object (e.g., taskDefinitionEvent, actorDefinitionEvent, driverJobDefinitionEvent), never at the top level of eventData.

Comment on lines +23 to +24
- name: RAY_enable_ray_event
value: "true"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to enable all types of event, we should always enable it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for enabling gcs level event

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also be "1" instead of bool string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

for _, eventType := range eventTypesWithJobID {
if nestedEvent, ok := eventData[eventType].(map[string]interface{}); ok {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the eventData come from the oneof protobuf? Can we just iterate it without using eventTypesWithJobID?

for _, nestedEvent := range eventData {
    ....
}

Copy link
Member Author

@Future-Outlier Future-Outlier Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. it doesn't come from the oneof protobuf
    https://github.com/ray-project/ray/blob/188d08743fff3baaf7e1baf076dc41e273f8635b/src/ray/protobuf/public/events_base_event.proto#L101
  2. I updated the code to
	for _, value := range eventData {
		if nestedEvent, ok := value.(map[string]interface{}); ok {
			if jobID, hasJob := nestedEvent["jobId"]; hasJob && jobID != "" {
				return fmt.Sprintf("%v", jobID)
			}
		}
	}

since this can support other case, for example in the future ray's proto change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, if it doesn't come from the oneof, then using this approach is probably not a good idea. we better define eventTypesWithJobID globally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, will revert it later.

Copy link
Member Author

@Future-Outlier Future-Outlier Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed here
0177160
420ce9a

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Comment on lines 29 to 32
- name: RAY_DASHBOARD_AGGREGATOR_AGENT_PUBLISHER_HTTP_ENDPOINT_EXPOSABLE_EVENT_TYPES
value: "TASK_DEFINITION_EVENT,TASK_LIFECYCLE_EVENT,ACTOR_TASK_DEFINITION_EVENT,
TASK_PROFILE_EVENT,DRIVER_JOB_DEFINITION_EVENT,DRIVER_JOB_LIFECYCLE_EVENT,
ACTOR_DEFINITION_EVENT,ACTOR_LIFECYCLE_EVENT,NODE_DEFINITION_EVENT,NODE_LIFECYCLE_EVENT"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will 1 on 1 with @sampan-s-nayak to figure out

  1. why I can't get task profile event now
  2. why default disable the task profile event

Copy link
Member Author

@Future-Outlier Future-Outlier Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this is fixed 8f87572

  2. I updated the PR description above with proof.

cc @rueian to merge, thank you!

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
@Future-Outlier Future-Outlier requested a review from rueian January 6, 2026 04:02
- env:
- name: RAY_enable_ray_event
value: "true"
- name: RAY_enable_core_worker_ray_event_to_aggregator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be "1", I dont think it accepts bool strings

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also note that even when this config is enabled, events will still go to GCS. to disable that we need to set RAY_enable_core_worker_task_event_to_gcs="0". but this will break cli and state API.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh TIL!

Copy link
Member Author

@Future-Outlier Future-Outlier Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, will keep in mind, curretly we might need to enable it until we can fully support removed the GCS, since there are some data not provided from base event.

for example:

/api/data/datasets/{job_id}
/api/serve/applications/

Copy link

@sampan-s-nayak sampan-s-nayak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from ray One-Event perspective

@rueian rueian merged commit 4540f18 into ray-project:master Jan 6, 2026
28 checks passed
@github-project-automation github-project-automation bot moved this from can be merged to Done in @Future-Outlier's kuberay project Jan 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P0 Critical issue that should be fixed ASAP

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants