-
Notifications
You must be signed in to change notification settings - Fork 687
[history server][collector] Fix getJobID for job event collection #4342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[history server][collector] Fix getJobID for job event collection #4342
Conversation
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: Jia-Wei Jiang <[email protected]>
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is super urgent, we need to merge this ASAP to unblock others to develop history server, and this is related to @chiayi 's event processor.
cc @rueian @andrewsykim plz merge, thank you!
Signed-off-by: Future-Outlier <[email protected]>
| if jobID, hasJob := eventData["jobId"]; hasJob && jobID != "" { | ||
| return fmt.Sprintf("%v", jobID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ray event structure never contains jobId at the top level of the event data.
Looking at actual Ray events, the structure is:
{
"eventId": "...",
"eventType": "TASK_DEFINITION_EVENT",
"message": "",
"sessionName": "...",
"severity": "INFO",
"sourceType": "CORE_WORKER",
"timestamp": "...",
"taskDefinitionEvent": { // <-- Nested event object
"jobId": "AgAAAA==", // <-- jobId is ALWAYS here (nested)
"language": "PYTHON",
...
}
}The jobId field is always nested inside the specific event type object (e.g., taskDefinitionEvent, actorDefinitionEvent, driverJobDefinitionEvent), never at the top level of eventData.
| - name: RAY_enable_ray_event | ||
| value: "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to enable all types of event, we should always enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for enabling gcs level event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also be "1" instead of bool string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind, was not aware of https://github.com/ray-project/kuberay/pull/4342/changes#r2663725854
| } | ||
|
|
||
| for _, eventType := range eventTypesWithJobID { | ||
| if nestedEvent, ok := eventData[eventType].(map[string]interface{}); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the eventData come from the oneof protobuf? Can we just iterate it without using eventTypesWithJobID?
for _, nestedEvent := range eventData {
....
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- it doesn't come from the oneof protobuf
https://github.com/ray-project/ray/blob/188d08743fff3baaf7e1baf076dc41e273f8635b/src/ray/protobuf/public/events_base_event.proto#L101 - I updated the code to
for _, value := range eventData {
if nestedEvent, ok := value.(map[string]interface{}); ok {
if jobID, hasJob := nestedEvent["jobId"]; hasJob && jobID != "" {
return fmt.Sprintf("%v", jobID)
}
}
}since this can support other case, for example in the future ray's proto change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, if it doesn't come from the oneof, then using this approach is probably not a good idea. we better define eventTypesWithJobID globally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, will revert it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
historyserver/config/raycluster.yaml
Outdated
| - name: RAY_DASHBOARD_AGGREGATOR_AGENT_PUBLISHER_HTTP_ENDPOINT_EXPOSABLE_EVENT_TYPES | ||
| value: "TASK_DEFINITION_EVENT,TASK_LIFECYCLE_EVENT,ACTOR_TASK_DEFINITION_EVENT, | ||
| TASK_PROFILE_EVENT,DRIVER_JOB_DEFINITION_EVENT,DRIVER_JOB_LIFECYCLE_EVENT, | ||
| ACTOR_DEFINITION_EVENT,ACTOR_LIFECYCLE_EVENT,NODE_DEFINITION_EVENT,NODE_LIFECYCLE_EVENT" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will 1 on 1 with @sampan-s-nayak to figure out
- why I can't get task profile event now
- why default disable the task profile event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
| - env: | ||
| - name: RAY_enable_ray_event | ||
| value: "true" | ||
| - name: RAY_enable_core_worker_ray_event_to_aggregator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be "1", I dont think it accepts bool strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it actually accepts bool strings
https://github.com/ray-project/ray/blob/9d0ad57ceee03fd133fafdd0c52199085027e9a2/src/ray/common/ray_config.h#L45-L48
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also note that even when this config is enabled, events will still go to GCS. to disable that we need to set RAY_enable_core_worker_task_event_to_gcs="0". but this will break cli and state API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh TIL!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, will keep in mind, curretly we might need to enable it until we can fully support removed the GCS, since there are some data not provided from base event.
for example:
/api/data/datasets/{job_id}
/api/serve/applications/
sampan-s-nayak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from ray One-Event perspective
Summary:
when developing historyserver, please
Why are these changes needed?
Before
Job events were not collected because
getJobIDonly checked top-leveljobId.Ray's export API returns job-related events with
jobIdnested inside event type objects.After
Fix
getJobIDfunction to correctly extractjobIdfrom nested event structures, enabling job event collection in history server.What did I do?
Changes
jobIdin nested event types:driverJobDefinitionEvent,driverJobLifecycleEvent,taskDefinitionEvent,taskLifecycleEvent,actorTaskDefinitionEvent,actorDefinitionEvent,taskProfileEvents[1]RAY_enable_ray_event=truein the example YAML to enable all events [2]RAY_DASHBOARD_AGGREGATOR_AGENT_EXPOSABLE_EVENT_TYPESin the example YAML to enable all events [2]How I tested it
job_eventsandnode_eventsfiles from S3 and verify all 9 event types are present [3]References
[1] Ray event proto definitions:
[2]
Ray config to enable export API: ray_config_def.h#L542
2026.01.06 dialogue from @sampan-s-nayak
[3]
those files are here:
session_2026-01-05_19-49-34_480793_1.zip
driverJobDefinitionEventdriverJobLifecycleEventtaskDefinitionEventtaskLifecycleEventtaskProfileEventsactorDefinitionEventactorTaskDefinitionEventnodeDefinitionEventnodeLifecycleEventactorLifecycleEventRelated issue number
Checks