-
Notifications
You must be signed in to change notification settings - Fork 686
[history server] Web Server + Event Processor #4329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[history server] Web Server + Event Processor #4329
Conversation
Co-authored-by: chiayi [email protected] Co-authored-by: KunWuLuan [email protected]
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: KunWuLuan <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @chiayi @KunWuLuan
to help review, thank you!
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
| const ( | ||
| NIL TaskStatus = "NIL" | ||
| PENDING_ARGS_AVAIL TaskStatus = "PENDING_ARGS_AVAIL" | ||
| PENDING_NODE_ASSIGNMENT TaskStatus = "PENDING_NODE_ASSIGNMENT" | ||
| PENDING_OBJ_STORE_MEM_AVAIL TaskStatus = "PENDING_OBJ_STORE_MEM_AVAIL" | ||
| PENDING_ARGS_FETCH TaskStatus = "PENDING_ARGS_FETCH" | ||
| SUBMITTED_TO_WORKER TaskStatus = "SUBMITTED_TO_WORKER" | ||
| PENDING_ACTOR_TASK_ARGS_FETCH TaskStatus = "PENDING_ACTOR_TASK_ARGS_FETCH" | ||
| PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY TaskStatus = "PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY" | ||
| RUNNING TaskStatus = "RUNNING" | ||
| RUNNING_IN_RAY_GET TaskStatus = "RUNNING_IN_RAY_GET" | ||
| RUNNING_IN_RAY_WAIT TaskStatus = "RUNNING_IN_RAY_WAIT" | ||
| FINISHED TaskStatus = "FINISHED" | ||
| FAILED TaskStatus = "FAILED" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Future-Outlier <[email protected]>
|
LGTM! Just a question you mentioned:
How does the event processor handle data consistency when multiple replicas are deployed? Currently each pod runs its own historyserver with in-memory state. Won't this cause inconsistent responses depending on which pod handles the request? |
|
todo:
|
yes it will, and this will be solved in the beta version. |
Signed-off-by: Future-Outlier <[email protected]>
I see. Thanks for the tips! |
…oxying Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
|
cc @chiayi @KunWuLuan to do a final pass, thank you! |
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
|
cursor review |
Signed-off-by: Future-Outlier <[email protected]>
|
cursor review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Bugbot reviewed your changes and found no bugs!
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
chiayi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Future-Outlier <[email protected]>
|
LGTM /approve |
| Mu sync.RWMutex | ||
| } | ||
|
|
||
| func (c *ClusterTaskMap) RLock() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need these funcs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go maps are not thread-safe. Concurrent read/write causes undefined behaviors, so we use locks.
https://go.dev/blog/maps#concurrency
┌─────────────────────┐ ┌─────────────────────┐
│ Event Processor │ │ HTTP Handler │
│ (goroutine 1..N) │ │ (goroutine 1..M) │
└──────────┬──────────┘ └──────────┬──────────┘
│ WRITE │ READ
▼ ▼
┌──────────────────────────────────────────┐
│ ClusterTaskMap (RWMutex) │
│ ┌────────────────────────────────────┐ │
│ │ TaskMap per cluster (Mutex) │ │
│ │ ┌──────────────────────────────┐ │ │
│ │ │ map[taskId] → []Task │ │ │
│ │ └──────────────────────────────┘ │ │
│ └────────────────────────────────────┘ │
└──────────────────────────────────────────┘
Co-authored-by: @chiayi [email protected]
Co-authored-by: @KunWuLuan [email protected]
Why are these changes needed?
This web server serves the history server's frontend and fetches data from the event server (processor).
As a follow-up, we should enable autoscaling for the web server using Kubernetes HPA.
https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/
Note: I combined code from this branch #4187 and this branch #4253, and then fixed a lot of bugs
Architecture
How to test and develop in your local env
response
Related issue number
#3966
#4374
HistoryServer Alpha Milestone Gap Analysis
Summary
API Endpoints (Terminated Clusters)
/clusters/nodes/nodes/{node_id}/events/api/cluster_status/api/grafana_health/api/prometheus_health/api/data/datasets/{job_id}/api/serve/applications//api/v0/placement_groups//api/v0/tasks/api/v0/tasks/summarize/api/v0/logs/api/v0/logs/file/logical/actors/logical/actors/{actor_id}/api/jobs/api/jobs/{job_id}Remaining Work (Priority)
/api/jobs,/api/jobs/{job_id}/eventsendpoint/nodes/{node_id}/api/v0/logs/file/api/cluster_status/api/grafana_health,/api/prometheus_health/api/serve/applications/,/api/v0/placement_groups/others:
Overall Progress: ~75%
Checks