Skip to content

[core][autoscaler][v1] prune IPs from the LoadMetrics after terminating nodes #52409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Apr 17, 2025

Why are these changes needed?

Currently, the autoscaler remove unwanted IPs from the LoadMetrics before terminating idle and dead nodes, therefore those nodes' IPs will be left in the LoadMetrics until the next autoscaling iteration.

This PR shifts the removal of IPs to occur after terminating idle and dead nodes, so that we won't have those nodes shown in the summary report.

Related issue number

Closes #52198 (comment)

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rueian rueian force-pushed the update-loadmetrics-after-node-terminations branch from 710b437 to 436418e Compare April 17, 2025 19:46
@rueian rueian force-pushed the update-loadmetrics-after-node-terminations branch 2 times, most recently from 07e32e6 to d89718c Compare April 17, 2025 20:33
@rueian rueian marked this pull request as ready for review April 18, 2025 00:01
@rueian rueian requested a review from a team as a code owner April 18, 2025 00:01
@rueian
Copy link
Contributor Author

rueian commented Apr 18, 2025

cc @dayshah for review. Thanks!

Comment on lines +3623 to +3625
assert lm.is_active(worker_ip)
autoscaler.update()
assert not lm.is_active(worker_ip)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lm.is_active(worker_ip) should still be True before autoscaler.update().
lm.is_active(worker_ip) should be False after autoscaler.update() because the idle worker_ip should be pruned by the change in this PR.

@rueian rueian force-pushed the update-loadmetrics-after-node-terminations branch from d89718c to f46a3e4 Compare April 19, 2025 03:00
@rueian rueian force-pushed the update-loadmetrics-after-node-terminations branch from f46a3e4 to caa1c00 Compare April 19, 2025 03:01
@rueian
Copy link
Contributor Author

rueian commented Apr 21, 2025

Gently ping @kevin85421 and @jjyao for reviews.

@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Apr 22, 2025
@kevin85421
Copy link
Member

I will discuss this PR with @rueian offline.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Left a nit comment.

@@ -437,6 +425,18 @@ def _update(self):
self.attempt_to_recover_unhealthy_nodes(now)
self.set_prometheus_updater_data()

# Update running nodes gauge
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: L406 - L426 update the list of alive nodes (i.e. non_terminated_nodes). This PR uses the updated list to update load metrics.

num_workers = len(self.non_terminated_nodes.worker_ids)
self.prom_metrics.running_workers.set(num_workers)

# Remove from LoadMetrics the ips unknown to the NodeProvider.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not easy to understand. Would you mind updating the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@kevin85421
Copy link
Member

cc @jjyao for merge or review

@jjyao jjyao merged commit 873e693 into ray-project:master Apr 23, 2025
5 checks passed
ktyxx pushed a commit to ktyxx/ray that referenced this pull request Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Autoscaler v1] AutoscalerSummary Active node check ignores raylet termination
3 participants