Skip to content

[Autoscaler v1] AutoscalerSummary Active node check ignores raylet termination #52198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nadongjun opened this issue Apr 10, 2025 · 10 comments · Fixed by #52409
Closed

[Autoscaler v1] AutoscalerSummary Active node check ignores raylet termination #52198

nadongjun opened this issue Apr 10, 2025 · 10 comments · Fixed by #52409
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues P1 Issue that should be fixed within a few weeks

Comments

@nadongjun
Copy link
Contributor

What happened + What you expected to happen

Description

  • The Ray Autoscaler (v1) AutoscalerSummary currently uses LoadMetrics.is_active(ip) to determine whether a node is active. However, this check does not account for whether the raylet on that node is still running.
  • In particular, if a node’s raylet has already exited (e.g., due to idle timeout), but the node is still returned by the NodeProvider as part of the non_terminated_nodes list, the autoscaler will incorrectly consider the node as active. This leads to inconsistencies in the summary() output.
  • Although this situation may not occur frequently, it highlights the need to revise the logic for determining active nodes. The current implementation results in inaccurate cluster state reporting in edge cases like this.

Versions / Dependencies

2.44.1

Reproduction script

Reproduction

  1. A Ray worker(ray.worker.gpu/192.168.1.40) is launched.
  2. The worker becomes idle and its raylet exits after the idle timeout.
  3. The NodeProvider still includes the worker in the non_terminated_nodes() response.
  4. As a result, the worker is still marked as active in the autoscaler summary and ray status, even though its raylet is no longer running.

Log

Even after Draining 1 raylet (ray.worker.gpu/192.168.1.40) was logged, the corresponding node’s IP remained in LoadMetrics, indicating that the node was still being tracked despite the raylet having been terminated.

======== Autoscaler status: 2025-04-10 01:30:28.225299 ========
Node status
---------------------------------------------------------------
Active:
 1 ray.head.default
 1 ray.worker.gpu
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/32.0 CPU
 0B/19.82GiB memory
 0B/9.15GiB object_store_memory

Demands:
 (no resource demands)
2025-04-10 01:30:28,225 INFO autoscaler.py:589 -- StandardAutoscaler: Terminating the node with id 100d04f27901636e710d8dab49eab126b2c0e4588579544beb8f3d72 and ip 192.168.1.40. (idle)
2025-04-10 01:30:28,225 INFO autoscaler.py:543 -- Node last used: Thu Apr 10 01:22:35 2025.
2025-04-10 01:30:28,225 INFO autoscaler.py:675 -- Draining 1 raylet(s).
2025-04-10 01:30:28,226 INFO node_provider.py:173 -- NodeProvider: 100d04f27901636e710d8dab49eab126b2c0e4588579544beb8f3d72: Terminating node
2025-04-10 01:30:28,226 INFO node_provider.py:176 -- submit_scale_request 
2025-04-10 01:30:28,226 INFO node_provider.py:199 -- {'desired_num_workers': {'ray.worker.gpu': 0}, 'workers_to_delete': ['100d04f27901636e710d8dab49eab126b2c0e4588579544beb8f3d72']}
2025-04-10 01:30:28,226 INFO node_provider.py:236 -- _patch
2025-04-10 01:30:28,227 DEBUG connectionpool.py:241 -- Starting new HTTP connection (1): 192.168.1.30:50000
2025-04-10 01:30:28,228 DEBUG connectionpool.py:544 -- http://192.168.1.30:50000 "PATCH /nodes HTTP/1.1" 200 37
2025-04-10 01:30:28,229 INFO autoscaler.py:461 -- The autoscaler took 0.006 seconds to complete the update iteration.
2025-04-10 01:30:28,229 INFO monitor.py:433 -- :event_summary:Removing 1 nodes of type ray.worker.gpu (idle).
2025-04-10 01:30:33,252 INFO node_provider.py:231 -- _get
2025-04-10 01:30:33,253 DEBUG connectionpool.py:241 -- Starting new HTTP connection (1): 192.168.1.30:50000
2025-04-10 01:30:33,255 DEBUG connectionpool.py:544 -- http://192.168.1.30:50000 "GET /nodes HTTP/1.1" 200 377
2025-04-10 01:30:33,256 INFO node_provider.py:172 -- get_node_data{'dd9f073845c670f20633936b798c3a26bb746a575d583580a490d8e0': NodeData(kind='head', type='ray.head.default', ip='192.168.1.10', status='up-to-date', replica_index=None), '100d04f27901636e710d8dab49eab126b2c0e4588579544beb8f3d72': NodeData(kind='worker', type='ray.worker.gpu', ip='192.168.1.40', status='up-to-date', replica_index=None)}
2025-04-10 01:30:33,256 INFO autoscaler.py:146 -- The autoscaler took 0.004 seconds to fetch the list of non-terminated nodes.
2025-04-10 01:30:33,256 INFO node_provider.py:203 -- safe_to_scale
2025-04-10 01:30:33,257 INFO custom_load_metrics.py:8 -- [CustomLoadMetrics] prune_active_ips called
2025-04-10 01:30:33,257 INFO custom_load_metrics.py:25 -- [CustomLoadMetrics] ray_nodes_last_used_time_by_ip: {'192.168.1.10': 1744248131.4891038, '192.168.1.40': 1744248155.2064462}
2025-04-10 01:30:33,257 INFO custom_load_metrics.py:27 -- [CustomLoadMetrics] static_resources_by_ip: {'192.168.1.10': {'memory': 9765379278.0, 'CPU': 16.0, 'object_store_memory': 4882689638.0, 'node:192.168.1.10': 1.0, 'node:__internal_head__': 1.0}, '192.168.1.40': {'memory': 11521410253.0, 'node:192.168.1.40': 1.0, 'CPU': 16.0, 'object_store_memory': 4937747251.0}}
2025-04-10 01:30:33,257 INFO custom_load_metrics.py:29 -- [CustomLoadMetrics] raylet_id_by_ip: {'192.168.1.10': b'\\\xf7\xc5^\xde\xf5t\x85j1]\x8f\xf2D6X\xcb\x11\xe5\xa5\xc2\xfc.E\x7f\xb0\xa3\xcf', '192.168.1.40': b'\xd5\x0b\x94~O\x13.\x89r\x84\x91w4\xac\xa0N9_g\x1d\x1e\x83Dc<\xf4~\x9f'}
2025-04-10 01:30:33,257 INFO custom_load_metrics.py:31 -- [CustomLoadMetrics] dynamic_resources_by_ip: {'192.168.1.10': {'memory': 9765379278.0, 'CPU': 16.0, 'object_store_memory': 4882689638.0, 'node:__internal_head__': 1.0, 'node:192.168.1.10': 1.0}, '192.168.1.40': {'node:192.168.1.40': 1.0, 'memory': 11521410253.0, 'CPU': 16.0, 'object_store_memory': 4937747251.0}}
2025-04-10 01:30:33,257 INFO custom_load_metrics.py:33 -- [CustomLoadMetrics] last_heartbeat_time_by_ip: {'192.168.1.10': 1744248633.2381039, '192.168.1.40': 1744248628.1984463}
2025-04-10 01:30:33,258 INFO autoscaler.py:418 -- 
======== Autoscaler status: 2025-04-10 01:30:33.258024 ========
Node status
---------------------------------------------------------------
Active:
 1 ray.head.default
 1 ray.worker.gpu
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/32.0 CPU
 0B/19.82GiB memory
 0B/9.15GiB object_store_memory

Issue Severity

Low: It annoys or frustrates me.

@nadongjun nadongjun added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 10, 2025
@masoudcharkhabi masoudcharkhabi added community-contribution Contributed by the community core Issues that should be addressed in Ray Core docs An issue or change related to documentation go add ONLY when ready to merge, run all tests P3 Issue moderate in impact or severity and removed community-contribution Contributed by the community P3 Issue moderate in impact or severity labels Apr 10, 2025
@bhks
Copy link

bhks commented Apr 10, 2025

I am happy to take up this one, since I am new to the community is there anyone who can help me understand little bit about the components and systems?

@nadongjun
Copy link
Contributor Author

@bhks Thank you, but I’m currently working on this task. If you have any questions related to the structure of the work, feel free to email me — I’ll do my best to help with what I know. Thanks!

@bhks
Copy link

bhks commented Apr 10, 2025

NP I will just watch for now then. Trying to get to work with the ray comunity.

@bhks
Copy link

bhks commented Apr 10, 2025

@nadongjun On the problem side are you thinking to use timestamp or heartbeat time reported for that node/Ip ? Or there are other mechanism within which can be utilized ?

@nadongjun
Copy link
Contributor Author

@bhks Yes, exactly. Right now, the Autoscaler reads the list of non-terminated nodes from the Provider and just checks whether there’s a value in last_heartbeat_time_by_ip. Even if a node has already passed the idle timeout or has been terminated, it won’t be reflected if the Provider still considers it active.

I’m thinking about two possible directions: one is to use LoadMetrics to determine whether a node has hit the idle or heartbeat timeout and should be removed, and the other is to clear the last_heartbeat_time_by_ip entry when the worker with that IP gets terminated, so the state stays accurate.

If you’re interested in the Autoscaler side, I’d also recommend checking out Autoscaler v2 / KubeRay — it’s actively being developed and worth following:

ray-project/kuberay#2600

@bhks
Copy link

bhks commented Apr 11, 2025

Thank you @nadongjun for sharing that issue with me I will follow up there.

@jcotant1 jcotant1 removed the go add ONLY when ready to merge, run all tests label Apr 11, 2025
@dayshah
Copy link
Contributor

dayshah commented Apr 15, 2025

cc @rueian can you pick this up
ty so much!

@dayshah dayshah added P1 Issue that should be fixed within a few weeks core-autoscaler autoscaler related issues and removed P3 Issue moderate in impact or severity triage Needs triage (eg: priority, bug/not-bug, and owning component) docs An issue or change related to documentation labels Apr 15, 2025
@rueian
Copy link
Contributor

rueian commented Apr 15, 2025

Sure, I will start working on this.

@nadongjun
Copy link
Contributor Author

Hi @rueian,
Would a simple heartbeat check like the one below be a reasonable approach here? What do you think?

def is_active(self, ip):
    last_heartbeat = self.last_heartbeat_time_by_ip.get(ip)
    if last_heartbeat is None:
        return False
    return (time.time() - last_heartbeat) < AUTOSCALER_HEARTBEAT_TIMEOUT_S

@rueian
Copy link
Contributor

rueian commented Apr 17, 2025

Hi @nadongjun,

We just need to update the LoadMetrics after we terminate those nodes, including idle nodes and dead nodes:

See #52409

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants