-
Notifications
You must be signed in to change notification settings - Fork 6.3k
[core][autoscaler][v1] prune IPs from the LoadMetrics after terminating nodes #52409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][autoscaler][v1] prune IPs from the LoadMetrics after terminating nodes #52409
Conversation
710b437
to
436418e
Compare
07e32e6
to
d89718c
Compare
cc @dayshah for review. Thanks! |
assert lm.is_active(worker_ip) | ||
autoscaler.update() | ||
assert not lm.is_active(worker_ip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lm.is_active(worker_ip)
should still be True
before autoscaler.update()
.
lm.is_active(worker_ip)
should be False
after autoscaler.update()
because the idle worker_ip
should be pruned by the change in this PR.
d89718c
to
f46a3e4
Compare
…ng nodes Co-authored-by: Dongjun Na <[email protected]> Signed-off-by: Rueian <[email protected]>
f46a3e4
to
caa1c00
Compare
Gently ping @kevin85421 and @jjyao for reviews. |
I will discuss this PR with @rueian offline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Left a nit comment.
@@ -437,6 +425,18 @@ def _update(self): | |||
self.attempt_to_recover_unhealthy_nodes(now) | |||
self.set_prometheus_updater_data() | |||
|
|||
# Update running nodes gauge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: L406 - L426 update the list of alive nodes (i.e. non_terminated_nodes
). This PR uses the updated list to update load metrics.
num_workers = len(self.non_terminated_nodes.worker_ids) | ||
self.prom_metrics.running_workers.set(num_workers) | ||
|
||
# Remove from LoadMetrics the ips unknown to the NodeProvider. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not easy to understand. Would you mind updating the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
…ng nodes Signed-off-by: Rueian <[email protected]>
cc @jjyao for merge or review |
…ng nodes (ray-project#52409) Signed-off-by: Rueian <[email protected]>
Why are these changes needed?
Currently, the autoscaler remove unwanted IPs from the LoadMetrics before terminating idle and dead nodes, therefore those nodes' IPs will be left in the LoadMetrics until the next autoscaling iteration.
This PR shifts the removal of IPs to occur after terminating idle and dead nodes, so that we won't have those nodes shown in the summary report.
Related issue number
Closes #52198 (comment)
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.