Skip to content

[Autoscaler v1] AutoscalerSummary Active node check ignores raylet termination #52198

Closed
@nadongjun

Description

@nadongjun

What happened + What you expected to happen

Description

  • The Ray Autoscaler (v1) AutoscalerSummary currently uses LoadMetrics.is_active(ip) to determine whether a node is active. However, this check does not account for whether the raylet on that node is still running.
  • In particular, if a node’s raylet has already exited (e.g., due to idle timeout), but the node is still returned by the NodeProvider as part of the non_terminated_nodes list, the autoscaler will incorrectly consider the node as active. This leads to inconsistencies in the summary() output.
  • Although this situation may not occur frequently, it highlights the need to revise the logic for determining active nodes. The current implementation results in inaccurate cluster state reporting in edge cases like this.

Versions / Dependencies

2.44.1

Reproduction script

Reproduction

  1. A Ray worker(ray.worker.gpu/192.168.1.40) is launched.
  2. The worker becomes idle and its raylet exits after the idle timeout.
  3. The NodeProvider still includes the worker in the non_terminated_nodes() response.
  4. As a result, the worker is still marked as active in the autoscaler summary and ray status, even though its raylet is no longer running.

Log

Even after Draining 1 raylet (ray.worker.gpu/192.168.1.40) was logged, the corresponding node’s IP remained in LoadMetrics, indicating that the node was still being tracked despite the raylet having been terminated.

======== Autoscaler status: 2025-04-10 01:30:28.225299 ========
Node status
---------------------------------------------------------------
Active:
 1 ray.head.default
 1 ray.worker.gpu
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/32.0 CPU
 0B/19.82GiB memory
 0B/9.15GiB object_store_memory

Demands:
 (no resource demands)
2025-04-10 01:30:28,225 INFO autoscaler.py:589 -- StandardAutoscaler: Terminating the node with id 100d04f27901636e710d8dab49eab126b2c0e4588579544beb8f3d72 and ip 192.168.1.40. (idle)
2025-04-10 01:30:28,225 INFO autoscaler.py:543 -- Node last used: Thu Apr 10 01:22:35 2025.
2025-04-10 01:30:28,225 INFO autoscaler.py:675 -- Draining 1 raylet(s).
2025-04-10 01:30:28,226 INFO node_provider.py:173 -- NodeProvider: 100d04f27901636e710d8dab49eab126b2c0e4588579544beb8f3d72: Terminating node
2025-04-10 01:30:28,226 INFO node_provider.py:176 -- submit_scale_request 
2025-04-10 01:30:28,226 INFO node_provider.py:199 -- {'desired_num_workers': {'ray.worker.gpu': 0}, 'workers_to_delete': ['100d04f27901636e710d8dab49eab126b2c0e4588579544beb8f3d72']}
2025-04-10 01:30:28,226 INFO node_provider.py:236 -- _patch
2025-04-10 01:30:28,227 DEBUG connectionpool.py:241 -- Starting new HTTP connection (1): 192.168.1.30:50000
2025-04-10 01:30:28,228 DEBUG connectionpool.py:544 -- http://192.168.1.30:50000 "PATCH /nodes HTTP/1.1" 200 37
2025-04-10 01:30:28,229 INFO autoscaler.py:461 -- The autoscaler took 0.006 seconds to complete the update iteration.
2025-04-10 01:30:28,229 INFO monitor.py:433 -- :event_summary:Removing 1 nodes of type ray.worker.gpu (idle).
2025-04-10 01:30:33,252 INFO node_provider.py:231 -- _get
2025-04-10 01:30:33,253 DEBUG connectionpool.py:241 -- Starting new HTTP connection (1): 192.168.1.30:50000
2025-04-10 01:30:33,255 DEBUG connectionpool.py:544 -- http://192.168.1.30:50000 "GET /nodes HTTP/1.1" 200 377
2025-04-10 01:30:33,256 INFO node_provider.py:172 -- get_node_data{'dd9f073845c670f20633936b798c3a26bb746a575d583580a490d8e0': NodeData(kind='head', type='ray.head.default', ip='192.168.1.10', status='up-to-date', replica_index=None), '100d04f27901636e710d8dab49eab126b2c0e4588579544beb8f3d72': NodeData(kind='worker', type='ray.worker.gpu', ip='192.168.1.40', status='up-to-date', replica_index=None)}
2025-04-10 01:30:33,256 INFO autoscaler.py:146 -- The autoscaler took 0.004 seconds to fetch the list of non-terminated nodes.
2025-04-10 01:30:33,256 INFO node_provider.py:203 -- safe_to_scale
2025-04-10 01:30:33,257 INFO custom_load_metrics.py:8 -- [CustomLoadMetrics] prune_active_ips called
2025-04-10 01:30:33,257 INFO custom_load_metrics.py:25 -- [CustomLoadMetrics] ray_nodes_last_used_time_by_ip: {'192.168.1.10': 1744248131.4891038, '192.168.1.40': 1744248155.2064462}
2025-04-10 01:30:33,257 INFO custom_load_metrics.py:27 -- [CustomLoadMetrics] static_resources_by_ip: {'192.168.1.10': {'memory': 9765379278.0, 'CPU': 16.0, 'object_store_memory': 4882689638.0, 'node:192.168.1.10': 1.0, 'node:__internal_head__': 1.0}, '192.168.1.40': {'memory': 11521410253.0, 'node:192.168.1.40': 1.0, 'CPU': 16.0, 'object_store_memory': 4937747251.0}}
2025-04-10 01:30:33,257 INFO custom_load_metrics.py:29 -- [CustomLoadMetrics] raylet_id_by_ip: {'192.168.1.10': b'\\\xf7\xc5^\xde\xf5t\x85j1]\x8f\xf2D6X\xcb\x11\xe5\xa5\xc2\xfc.E\x7f\xb0\xa3\xcf', '192.168.1.40': b'\xd5\x0b\x94~O\x13.\x89r\x84\x91w4\xac\xa0N9_g\x1d\x1e\x83Dc<\xf4~\x9f'}
2025-04-10 01:30:33,257 INFO custom_load_metrics.py:31 -- [CustomLoadMetrics] dynamic_resources_by_ip: {'192.168.1.10': {'memory': 9765379278.0, 'CPU': 16.0, 'object_store_memory': 4882689638.0, 'node:__internal_head__': 1.0, 'node:192.168.1.10': 1.0}, '192.168.1.40': {'node:192.168.1.40': 1.0, 'memory': 11521410253.0, 'CPU': 16.0, 'object_store_memory': 4937747251.0}}
2025-04-10 01:30:33,257 INFO custom_load_metrics.py:33 -- [CustomLoadMetrics] last_heartbeat_time_by_ip: {'192.168.1.10': 1744248633.2381039, '192.168.1.40': 1744248628.1984463}
2025-04-10 01:30:33,258 INFO autoscaler.py:418 -- 
======== Autoscaler status: 2025-04-10 01:30:33.258024 ========
Node status
---------------------------------------------------------------
Active:
 1 ray.head.default
 1 ray.worker.gpu
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/32.0 CPU
 0B/19.82GiB memory
 0B/9.15GiB object_store_memory

Issue Severity

Low: It annoys or frustrates me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray Corecore-autoscalerautoscaler related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions