[Umbrella] Autoscaler improvements #2600

kevin85421 · 2024-12-04T01:53:27Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

TODO:

Define the scope of Autoscaler V2 beta
List the important issues need to solve (V1 & V2).

This umbrella issue covers two topics:

Autoscaler V2 towards beta
Autoscaler stability improvements (V1 + V2)

Reliability

Top priority:

Autoscaler should not terminate worker Pods with running Actor / Task
Autoscaler should not crash because of CR spec
A Job should be able to finish.

Usability

Testing

[Umbrella] Ray Autoscaling tests #2173

Observability / Debuggability

[core][autoscaler] Health check logs are not visible in the autoscaler container's stdout ray#48905
De-noise autoscaler logs. Currently, the autoscaler loops with Fetched pod data, outputting the state of the RayCluster even with no changes to the resources requested or allocated. This can make it fairly difficult to debug autoscaler logs. It'd be useful to provide the option to output relevant logs only on Autoscaler updates. (issue to be created)
- Not urgent. We can set up env var to configure it.
[core][autoscaler] Better observability for request resources ray#37959
[autoscaler] Refactor ray status output code ray#37856 (@ryanaoleary)
[core][autoscaler] Add Pod names to the output of ray status -v ray#51192

Refactor

Backlogs

[Feature] Don't submit scale requests if the worker group is suspended #2666 (relies on KubeRay v1.3)
[autoscaler][core][v2] Support all clouds for instance type name from config ray#39735
- install_ray is only used when disable_node_updaters is false, so KubeRay doesn't use this.
[Core] Fractional unit instance resource may not trigger autoscaling ray#39987
[autoscaler] Able to inspect autoscaler state/cluster config ray#37838

Use case

No response

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

kevin85421 · 2024-12-04T01:59:29Z

/assign @ryanaoleary

ryanaoleary · 2024-12-04T19:02:23Z

TODO: I'll leave a comment outlining the V2 beta scope and remaining issues to solve for V1 & V2.

kevin85421 · 2024-12-04T19:42:54Z

@ryanaoleary thanks! You can compile a list of issues, and we can schedule a meeting to go through them one by one.

ryanaoleary · 2024-12-10T08:20:43Z

The issues that I think we ought to complete before considering Autoscaler v2 in Beta can be broken down into observability improvements and reliability bug-fixes.

Observability:

Provide an API to persist worker logs after autoscaler scale-down, this is covered in the export API proposal (this may fall outside the scope of KubeRay v1.3)
De-noise autoscaler logs. Currently, the autoscaler loops with Fetched pod data, outputting the state of the RayCluster even with no changes to the resources requested or allocated. This can make it fairly difficult to debug autoscaler logs. It'd be useful to provide the option to output relevant logs only on Autoscaler updates. (issue to be created)
Direct formatting to output ray status: [autoscaler] Refactor ray status output code ray#37856
Refactor request resources in autoscaler logs: [core][autoscaler] Better observability for request resources ray#37959
Generate node IDs: [autoscaler] Autoscaler should avoid using ip address to identify nodes (as node id of node provider) ray#19086
Expose API to expose autoscaler state: [autoscaler] Able to inspect autoscaler state/cluster config ray#37838

Reliability:

Maintain correctness of cluster state version: [core][autoscaler] Autoscaler V2 - cluster state version number ray#35873
Reserve custom accelerators: [core][autoscaler][v2] Reserve nodes with accelerator nodes. ray#43079
Further v2 autoscaler tests in Ray Core and e2e v2 Autoscaler tests using the latest Ray image in KubeRay.
[autoscaler][core][v2] Support all clouds for instance type name from config ray#39735

Several of the completed issues mentioned in the issue description fix the main reliability issues found within the v1 Autoscaler, and from my manual testing with multiple CPU and GPU worker-groups I've seen consistent behavior. Additionally, new features in the v2 autoscaler such as configuring idle node timeouts by node type will enable users with more fine-grained control of their workloads and minimize the amount of autoscaling errors we were previously seeing. We should also consider it a requirement to ensure reliable testing for CPUs, GPUs, and custom accelerators before considering v2 beta.

kevin85421 · 2024-12-11T21:42:25Z

TODO: confirm the following two issues.

[autoscaler] Autoscaler should avoid using ip address to identify nodes (as node id of node provider) ray#19086
[core][autoscaler][v2] Reserve nodes with accelerator nodes. ray#43079

kevin85421 · 2025-02-25T23:24:48Z

#2999 (comment)
#2999 (comment)

ryanaoleary · 2025-02-26T07:47:40Z

Tracking issue for the e2e upgrade tests: #2561

ryanaoleary · 2025-02-26T07:48:53Z

V2 Autoscaler issues and bug fixes:

[Autoscaler][V2] Autoscaler V2 does not honor 'Conservative' upscaling mode ray#50259
[Autoscaler][V2] Updating max replicas while Pods are pending causes v2 autoscaler to hang ray#50868

rueian · 2025-03-13T01:01:53Z

Add one more issue for autoscaler v2 ray-project/ray#51321.

bhks · 2025-04-11T01:15:17Z

Hey folks @kevin85421 , @rueian and @ryanaoleary great work here. I know I am late to the party but very interested in joining the team to support. Let me know if any task I can pick up. I have experience in k8s and observability.

Thanks @nadongjun for pointing me to this.

if you guys have any document/RFC/Design I can read that would be great.

kevin85421 · 2025-04-11T04:04:24Z

@bhks Thank you for reaching out! You can check the user guide and the design doc for more details. PRs are welcome. I suggest starting with Autoscaler V2 on KubeRay first. This is a high priority for me at the moment, so related PRs will be reviewed faster. In addition, starting with small PRs makes it easier for them to be merged.

bhks · 2025-04-11T05:10:36Z

Thank you @kevin85421, Do you have any task in mind which I can start with ?

rueian · 2025-04-16T07:25:03Z

Add a new one ray-project/ray#52361. I will open a PR soon.

kevin85421 added enhancement New feature or request triage labels Dec 4, 2024

kevin85421 changed the title ~~[Umbrella] Autoscaler V2 towards beta~~ [WIP][Umbrella] Autoscaler V2 towards beta Dec 4, 2024

kevin85421 changed the title ~~[WIP][Umbrella] Autoscaler V2 towards beta~~ [WIP][Umbrella] Autoscaler improvements Dec 4, 2024

kevin85421 added autoscaler 1.3.0 and removed triage labels Dec 4, 2024

kevin85421 self-assigned this Dec 4, 2024

kevin85421 assigned ryanaoleary Dec 4, 2024

kevin85421 mentioned this issue Dec 11, 2024

[<RAY Core|Train>] Autoscaler error after serveral hours running ray-project/ray#48834

Closed

kevin85421 changed the title ~~[WIP][Umbrella] Autoscaler improvements~~ [Umbrella] Autoscaler improvements Dec 11, 2024

ryanaoleary mentioned this issue Jan 16, 2025

[Bug][Autoscaler] Ray V2 Autoscaler stalls when scaling multiple worker Pods with KubeRay #2759

Closed

2 tasks

ryanaoleary mentioned this issue Jan 31, 2025

Update KubeRay Autoscaler to use NumOfHosts for min/max workers ray-project/ray#48212

Merged

8 tasks

kevin85421 mentioned this issue Feb 10, 2025

[Roadmap] KubeRay (or anything for Ray on K8s) v1.4.0 Wishlist #2999

Open

kevin85421 added 1.4.0 and removed 1.3.0 labels Feb 11, 2025

ryanaoleary mentioned this issue Feb 11, 2025

[Autoscaler][V2] Autoscaler V2 does not honor 'Conservative' upscaling mode ray-project/ray#50259

Open

rueian mentioned this issue Mar 13, 2025

[core][autoscaler][v2] do not removing nodes for upcoming resource requests ray-project/ray#51321

Closed

kevin85421 mentioned this issue Mar 22, 2025

[Autoscaler] Update CoordinatorNodeProvider example ray-project/ray#51293

Merged

8 tasks

kevin85421 assigned rueian and unassigned ryanaoleary Mar 24, 2025

nadongjun mentioned this issue Apr 11, 2025

[Autoscaler v1] AutoscalerSummary Active node check ignores raylet termination ray-project/ray#52198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Umbrella] Autoscaler improvements #2600

[Umbrella] Autoscaler improvements #2600

kevin85421 commented Dec 4, 2024 •

edited

Loading

kevin85421 commented Dec 4, 2024

ryanaoleary commented Dec 4, 2024

kevin85421 commented Dec 4, 2024

ryanaoleary commented Dec 10, 2024 •

edited

Loading

kevin85421 commented Dec 11, 2024

kevin85421 commented Feb 25, 2025

ryanaoleary commented Feb 26, 2025

ryanaoleary commented Feb 26, 2025

rueian commented Mar 13, 2025

bhks commented Apr 11, 2025 •

edited

Loading

kevin85421 commented Apr 11, 2025

bhks commented Apr 11, 2025

rueian commented Apr 16, 2025

[Umbrella] Autoscaler improvements #2600

[Umbrella] Autoscaler improvements #2600

Comments

kevin85421 commented Dec 4, 2024 • edited Loading

Search before asking

Description

Reliability

Usability

Testing

Observability / Debuggability

Refactor

Backlogs

Use case

Related issues

Are you willing to submit a PR?

kevin85421 commented Dec 4, 2024

ryanaoleary commented Dec 4, 2024

kevin85421 commented Dec 4, 2024

ryanaoleary commented Dec 10, 2024 • edited Loading

kevin85421 commented Dec 11, 2024

kevin85421 commented Feb 25, 2025

ryanaoleary commented Feb 26, 2025

ryanaoleary commented Feb 26, 2025

rueian commented Mar 13, 2025

bhks commented Apr 11, 2025 • edited Loading

kevin85421 commented Apr 11, 2025

bhks commented Apr 11, 2025

rueian commented Apr 16, 2025

kevin85421 commented Dec 4, 2024 •

edited

Loading

ryanaoleary commented Dec 10, 2024 •

edited

Loading

bhks commented Apr 11, 2025 •

edited

Loading