Skip to content

[Umbrella] Autoscaler improvements #2600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
26 of 48 tasks
kevin85421 opened this issue Dec 4, 2024 · 13 comments
Open
26 of 48 tasks

[Umbrella] Autoscaler improvements #2600

kevin85421 opened this issue Dec 4, 2024 · 13 comments
Assignees
Labels

Comments

@kevin85421
Copy link
Member

kevin85421 commented Dec 4, 2024

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

TODO:

  • Define the scope of Autoscaler V2 beta
  • List the important issues need to solve (V1 & V2).

This umbrella issue covers two topics:

  • Autoscaler V2 towards beta
  • Autoscaler stability improvements (V1 + V2)

Reliability

Top priority:

  • Autoscaler should not terminate worker Pods with running Actor / Task
  • Autoscaler should not crash because of CR spec
  • A Job should be able to finish.

Usability

Testing

Observability / Debuggability

Refactor

Backlogs

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@kevin85421 kevin85421 added enhancement New feature or request triage labels Dec 4, 2024
@kevin85421 kevin85421 changed the title [Umbrella] Autoscaler V2 towards beta [WIP][Umbrella] Autoscaler V2 towards beta Dec 4, 2024
@kevin85421 kevin85421 changed the title [WIP][Umbrella] Autoscaler V2 towards beta [WIP][Umbrella] Autoscaler improvements Dec 4, 2024
@kevin85421 kevin85421 self-assigned this Dec 4, 2024
@kevin85421
Copy link
Member Author

/assign @ryanaoleary

@ryanaoleary
Copy link
Contributor

TODO: I'll leave a comment outlining the V2 beta scope and remaining issues to solve for V1 & V2.

@kevin85421
Copy link
Member Author

@ryanaoleary thanks! You can compile a list of issues, and we can schedule a meeting to go through them one by one.

@ryanaoleary
Copy link
Contributor

ryanaoleary commented Dec 10, 2024

The issues that I think we ought to complete before considering Autoscaler v2 in Beta can be broken down into observability improvements and reliability bug-fixes.

Observability:

Reliability:

Several of the completed issues mentioned in the issue description fix the main reliability issues found within the v1 Autoscaler, and from my manual testing with multiple CPU and GPU worker-groups I've seen consistent behavior. Additionally, new features in the v2 autoscaler such as configuring idle node timeouts by node type will enable users with more fine-grained control of their workloads and minimize the amount of autoscaling errors we were previously seeing. We should also consider it a requirement to ensure reliable testing for CPUs, GPUs, and custom accelerators before considering v2 beta.

@kevin85421
Copy link
Member Author

@ryanaoleary
Copy link
Contributor

Tracking issue for the e2e upgrade tests: #2561

@rueian
Copy link
Contributor

rueian commented Mar 13, 2025

Add one more issue for autoscaler v2 ray-project/ray#51321.

@bhks
Copy link

bhks commented Apr 11, 2025

Hey folks @kevin85421 , @rueian and @ryanaoleary great work here. I know I am late to the party but very interested in joining the team to support. Let me know if any task I can pick up. I have experience in k8s and observability.

Thanks @nadongjun for pointing me to this.

if you guys have any document/RFC/Design I can read that would be great.

@kevin85421
Copy link
Member Author

@bhks Thank you for reaching out! You can check the user guide and the design doc for more details. PRs are welcome. I suggest starting with Autoscaler V2 on KubeRay first. This is a high priority for me at the moment, so related PRs will be reviewed faster. In addition, starting with small PRs makes it easier for them to be merged.

@bhks
Copy link

bhks commented Apr 11, 2025

Thank you @kevin85421, Do you have any task in mind which I can start with ?

@rueian
Copy link
Contributor

rueian commented Apr 16, 2025

Add a new one ray-project/ray#52361. I will open a PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants