-
Notifications
You must be signed in to change notification settings - Fork 6.2k
[Autoscaler v1] AutoscalerSummary Active node check ignores raylet termination #52198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am happy to take up this one, since I am new to the community is there anyone who can help me understand little bit about the components and systems? |
@bhks Thank you, but I’m currently working on this task. If you have any questions related to the structure of the work, feel free to email me — I’ll do my best to help with what I know. Thanks! |
NP I will just watch for now then. Trying to get to work with the ray comunity. |
@nadongjun On the problem side are you thinking to use timestamp or heartbeat time reported for that node/Ip ? Or there are other mechanism within which can be utilized ? |
@bhks Yes, exactly. Right now, the Autoscaler reads the list of non-terminated nodes from the Provider and just checks whether there’s a value in I’m thinking about two possible directions: one is to use LoadMetrics to determine whether a node has hit the idle or heartbeat timeout and should be removed, and the other is to clear the last_heartbeat_time_by_ip entry when the worker with that IP gets terminated, so the state stays accurate. If you’re interested in the Autoscaler side, I’d also recommend checking out Autoscaler v2 / KubeRay — it’s actively being developed and worth following: |
Thank you @nadongjun for sharing that issue with me I will follow up there. |
cc @rueian can you pick this up |
Sure, I will start working on this. |
Hi @rueian, def is_active(self, ip):
last_heartbeat = self.last_heartbeat_time_by_ip.get(ip)
if last_heartbeat is None:
return False
return (time.time() - last_heartbeat) < AUTOSCALER_HEARTBEAT_TIMEOUT_S |
Hi @nadongjun, We just need to update the LoadMetrics after we terminate those nodes, including idle nodes and dead nodes: See #52409 |
What happened + What you expected to happen
Description
Versions / Dependencies
2.44.1
Reproduction script
Reproduction
Log
Even after Draining 1 raylet (ray.worker.gpu/192.168.1.40) was logged, the corresponding node’s IP remained in LoadMetrics, indicating that the node was still being tracked despite the raylet having been terminated.
Issue Severity
Low: It annoys or frustrates me.
The text was updated successfully, but these errors were encountered: