-
Notifications
You must be signed in to change notification settings - Fork 688
[history server][collector] Fix getJobID for job event collection #4342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b0172bc
eeb7a7f
0ff5bcf
5aea305
5f547a2
0177160
420ce9a
1b1d004
8f87572
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,10 +20,18 @@ spec: | |
| affinity: | ||
| containers: | ||
| - env: | ||
| - name: RAY_enable_ray_event | ||
| value: "true" | ||
| - name: RAY_enable_core_worker_ray_event_to_aggregator | ||
| value: "1" | ||
| value: "true" | ||
| - name: RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR | ||
| value: "http://localhost:8084/v1/events" | ||
| # in ray 2.52.0, we need to set RAY_DASHBOARD_AGGREGATOR_AGENT_EXPOSABLE_EVENT_TYPES | ||
| # in ray 2.53.0 (noy yet done). we need to set RAY_DASHBOARD_AGGREGATOR_AGENT_PUBLISHER_HTTP_ENDPOINT_EXPOSABLE_EVENT_TYPES | ||
| - name: RAY_DASHBOARD_AGGREGATOR_AGENT_EXPOSABLE_EVENT_TYPES | ||
| value: "TASK_DEFINITION_EVENT,TASK_LIFECYCLE_EVENT,ACTOR_TASK_DEFINITION_EVENT, | ||
| TASK_PROFILE_EVENT,DRIVER_JOB_DEFINITION_EVENT,DRIVER_JOB_LIFECYCLE_EVENT, | ||
| ACTOR_DEFINITION_EVENT,ACTOR_LIFECYCLE_EVENT,NODE_DEFINITION_EVENT,NODE_LIFECYCLE_EVENT" | ||
| image: rayproject/ray:2.52.0 | ||
| imagePullPolicy: IfNotPresent | ||
| command: | ||
|
|
@@ -105,10 +113,10 @@ spec: | |
| workerGroupSpecs: | ||
| - groupName: cpu | ||
| maxReplicas: 1000 | ||
| minReplicas: 0 | ||
| minReplicas: 1 | ||
| numOfHosts: 1 | ||
| rayStartParams: {} | ||
| replicas: 0 | ||
| replicas: 1 | ||
| template: | ||
| metadata: | ||
| labels: | ||
|
|
@@ -117,10 +125,18 @@ spec: | |
| imagePullSecrets: | ||
| containers: | ||
| - env: | ||
| - name: RAY_enable_ray_event | ||
| value: "true" | ||
| - name: RAY_enable_core_worker_ray_event_to_aggregator | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this should be "1", I dont think it accepts bool strings
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it actually accepts bool strings There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also note that even when this config is enabled, events will still go to GCS. to disable that we need to set RAY_enable_core_worker_task_event_to_gcs="0". but this will break cli and state API. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh TIL!
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thank you, will keep in mind, curretly we might need to enable it until we can fully support removed the GCS, since there are some data not provided from base event. for example: |
||
| value: "1" | ||
| value: "true" | ||
| - name: RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR | ||
| value: "http://localhost:8084/v1/events" | ||
| # in ray 2.52.0, we need to set RAY_DASHBOARD_AGGREGATOR_AGENT_EXPOSABLE_EVENT_TYPES | ||
| # in ray 2.53.0 (not yet done). we need to set RAY_DASHBOARD_AGGREGATOR_AGENT_PUBLISHER_HTTP_ENDPOINT_EXPOSABLE_EVENT_TYPES | ||
| - name: RAY_DASHBOARD_AGGREGATOR_AGENT_EXPOSABLE_EVENT_TYPES | ||
| value: "TASK_DEFINITION_EVENT,TASK_LIFECYCLE_EVENT,ACTOR_TASK_DEFINITION_EVENT, | ||
| TASK_PROFILE_EVENT,DRIVER_JOB_DEFINITION_EVENT,DRIVER_JOB_LIFECYCLE_EVENT, | ||
| ACTOR_DEFINITION_EVENT,ACTOR_LIFECYCLE_EVENT,NODE_DEFINITION_EVENT,NODE_LIFECYCLE_EVENT" | ||
| image: rayproject/ray:2.52.0 | ||
| command: | ||
| - 'echo "=========================================="; [ -d "/tmp/ray/session_latest" ] && dest="/tmp/ray/prev-logs/$(basename $(readlink /tmp/ray/session_latest))/$(cat /tmp/ray/raylet_node_id)" && echo "dst is $dest" && mkdir -p "$dest" && mv /tmp/ray/session_latest/logs "$dest/logs"; echo "========================================="' | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -45,6 +45,21 @@ type EventServer struct { | |
| mutex sync.Mutex | ||
| } | ||
|
|
||
| var eventTypesWithJobID = []string{ | ||
| // Job Events (Driver Job) | ||
| "driverJobDefinitionEvent", | ||
| "driverJobLifecycleEvent", | ||
|
|
||
| // Task Events (Normal Task) | ||
| "taskDefinitionEvent", | ||
| "taskLifecycleEvent", | ||
| "taskProfileEvents", | ||
|
|
||
| // Actor Events (Actor Task + Actor Definition) | ||
| "actorTaskDefinitionEvent", | ||
| "actorDefinitionEvent", | ||
| } | ||
|
|
||
| func NewEventServer(writer storage.StorageWriter, rootDir, sessionDir, nodeID, clusterName, clusterID, sessionName string) *EventServer { | ||
| server := &EventServer{ | ||
| events: make([]Event, 0), | ||
|
|
@@ -408,9 +423,14 @@ func (es *EventServer) isNodeEvent(eventData map[string]interface{}) bool { | |
|
|
||
| // getJobID gets jobID associated with event | ||
| func (es *EventServer) getJobID(eventData map[string]interface{}) string { | ||
| if jobID, hasJob := eventData["jobId"]; hasJob && jobID != "" { | ||
| return fmt.Sprintf("%v", jobID) | ||
|
Comment on lines
-411
to
-412
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ray event structure never contains Looking at actual Ray events, the structure is: {
"eventId": "...",
"eventType": "TASK_DEFINITION_EVENT",
"message": "",
"sessionName": "...",
"severity": "INFO",
"sourceType": "CORE_WORKER",
"timestamp": "...",
"taskDefinitionEvent": { // <-- Nested event object
"jobId": "AgAAAA==", // <-- jobId is ALWAYS here (nested)
"language": "PYTHON",
...
}
}The |
||
| for _, eventType := range eventTypesWithJobID { | ||
| if nestedEvent, ok := eventData[eventType].(map[string]interface{}); ok { | ||
| if jobID, hasJob := nestedEvent["jobId"]; hasJob && jobID != "" { | ||
| return fmt.Sprintf("%v", jobID) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return "" | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to enable all types of event, we should always enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for enabling gcs level event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also be "1" instead of bool string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind, was not aware of https://github.com/ray-project/kuberay/pull/4342/changes#r2663725854