-
Notifications
You must be signed in to change notification settings - Fork 523
[Umbrella] Autoscaler improvements #2600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/assign @ryanaoleary |
TODO: I'll leave a comment outlining the V2 beta scope and remaining issues to solve for V1 & V2. |
@ryanaoleary thanks! You can compile a list of issues, and we can schedule a meeting to go through them one by one. |
The issues that I think we ought to complete before considering Autoscaler v2 in Beta can be broken down into observability improvements and reliability bug-fixes. Observability:
Reliability:
Several of the completed issues mentioned in the issue description fix the main reliability issues found within the v1 Autoscaler, and from my manual testing with multiple CPU and GPU worker-groups I've seen consistent behavior. Additionally, new features in the v2 autoscaler such as configuring idle node timeouts by node type will enable users with more fine-grained control of their workloads and minimize the amount of autoscaling errors we were previously seeing. We should also consider it a requirement to ensure reliable testing for CPUs, GPUs, and custom accelerators before considering v2 beta. |
Tracking issue for the e2e upgrade tests: #2561 |
Add one more issue for autoscaler v2 ray-project/ray#51321. |
Hey folks @kevin85421 , @rueian and @ryanaoleary great work here. I know I am late to the party but very interested in joining the team to support. Let me know if any task I can pick up. I have experience in k8s and observability. Thanks @nadongjun for pointing me to this. if you guys have any document/RFC/Design I can read that would be great. |
@bhks Thank you for reaching out! You can check the user guide and the design doc for more details. PRs are welcome. I suggest starting with Autoscaler V2 on KubeRay first. This is a high priority for me at the moment, so related PRs will be reviewed faster. In addition, starting with small PRs makes it easier for them to be merged. |
Thank you @kevin85421, Do you have any task in mind which I can start with ? |
Add a new one ray-project/ray#52361. I will open a PR soon. |
Search before asking
Description
TODO:
This umbrella issue covers two topics:
Reliability
Top priority:
Usability
min_nodes
during interactive sessions ray#47248Testing
Observability / Debuggability
[core][autoscaler] Health check logs are not visible in the autoscaler container's stdout ray#48905
De-noise autoscaler logs. Currently, the autoscaler loops with Fetched pod data, outputting the state of the RayCluster even with no changes to the resources requested or allocated. This can make it fairly difficult to debug autoscaler logs. It'd be useful to provide the option to output relevant logs only on Autoscaler updates. (issue to be created)
[core][autoscaler] Better observability for request resources ray#37959
[autoscaler] Refactor ray status output code ray#37856 (@ryanaoleary)
[core][autoscaler] Add Pod names to the output of
ray status -v
ray#51192Refactor
ray.io/group
label. ray#48840provider_exists
andallow_multiple
ray#49812Backlogs
install_ray
is only used whendisable_node_updaters
is false, so KubeRay doesn't use this.Use case
No response
Related issues
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: