[autoscaler] Improve experience when EC2 does not have capacity for worker nodes #10

jennakwon06 · 2021-02-10T05:40:09Z

Hello -

After I spinned up the cluster with ray up my_cluster.yaml, my workload wasn't really getting handled well by the Ray cluster. I tried ray monitor my_cluster.yaml then found out that the logs were flooded with below messages:

==> /tmp/ray/session_latest/logs/monitor.err <==
ssh: connect to host 10.0.80.64 port 22: Connection timed out

==> /tmp/ray/session_latest/logs/monitor.log <==
2021-02-10 05:24:28,068 INFO node_launcher.py:78 -- NodeLauncher1: Got 5 nodes to launch.
2021-02-10 05:24:28,186 INFO node_launcher.py:78 -- NodeLauncher1: Launching 5 nodes, type ray-legacy-worker-node-type.

==> /tmp/ray/session_latest/logs/monitor.out <==
2021-02-10 05:24:27,659 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient r5n.24xlarge capacity in the Availability Zone you requested (us-west-2b). Our system will be working on provisioning additional capacity. You can currently get r5n.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2a, us-west-2c., retrying.
2021-02-10 05:24:27,683 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (RequestLimitExceeded) when calling the RunInstances operation (reached max retries: 0): Request limit exceeded., retrying.
2021-02-10 05:24:27,715 INFO node_provider.py:378 -- Launched 5 nodes [subnet_id=subnet-0180e9267b994bf97]
2021-02-10 05:24:27,715 INFO node_provider.py:397 -- Launched instance i-03fb4297fc5f3f1cd [state=pending, info=pending]
2021-02-10 05:24:27,789 INFO updater.py:273 -- SSH still not available (SSH command failed.), retrying in 5 seconds.

==> /tmp/ray/session_latest/logs/monitor.log <==
2021-02-10 05:24:28,538 ERROR node_launcher.py:72 -- Launch failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 370, in _create_node
    created = self.ec2_fail_fast.create_instances(**conf)
  File "/usr/local/lib/python3.7/site-packages/boto3/resources/factory.py", line 520, in do_action
    response = action(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/boto3/resources/action.py", line 83, in __call__
    response = getattr(parent.meta.client, operation_name)(*args, **params)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 676, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (r5n.24xlarge) is not supported in your requested Availability Zone (us-west-2d). Please retry your request by not specifying an Availability Zone or choosing us-west-2a, us-west-2b, us-west-2c.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/node_launcher.py", line 70, in run
    self._launch_node(config, count, node_type)
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/node_launcher.py", line 60, in _launch_node
    self.provider.create_node(node_config, node_tags, count)
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 311, in create_node
    created_nodes_dict = self._create_node(node_config, tags, count)
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 403, in _create_node
    "Failed to launch instances. Max attempts exceeded.")
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py", line 585, in abort
    raise exc_cls("Exiting due to cli_logger.abort()")
click.exceptions.ClickException: Exiting due to cli_logger.abort()
2021-02-10 05:24:28,538 INFO node_launcher.py:78 -- NodeLauncher0: Got 2 nodes to launch.
2021-02-10 05:24:28,774 INFO node_launcher.py:78 -- NodeLauncher0: Launching 2 nodes, type ray-legacy-worker-node-type.

==> /tmp/ray/session_latest/logs/monitor.out <==
2021-02-10 05:24:28,538 PANIC node_provider.py:403 -- Failed to launch instances. Max attempts exceeded.
2021-02-10 05:24:29,000 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient r5n.24xlarge capacity in the Availability Zone you requested (us-west-2a). Our system will be working on provisioning additional capacity. You can currently get r5n.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2b, us-west-2c., retrying.
2021-02-10 05:24:29,023 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (RequestLimitExceeded) when calling the RunInstances operation (reached max retries: 0): Request limit exceeded., retrying.

So it looks like the requested instance type isn't really available by EC2, and Ray isn't able to spin up desired worker nodes. This means I need to run ray down, modify the my_cluster.yaml, then retry with a different instance type.

I was wondering if we can improve this experience. Perhaps check the EC2 instance capacity for at least minimum # of workers before telling the user that cluster is launched? Or perhaps let user specify list of instance types that they're OK with?

The text was updated successfully, but these errors were encountered:

pdames · 2021-02-10T07:41:50Z

An example showing how to specify multiple candidate instance types can be seen at:
https://github.com/amzn/amazon-ray/blob/main/python/ray/autoscaler/aws/example-multi-node-type.yaml

Out of curiosity, did you specify any quantity for min_workers in your autoscaler config? I think the existing behavior is probably OK in the event that Ray is unable to spin your cluster up to the desired max_workers capacity, but I would agree that it's misleading to tell a user that their cluster has been successfully launched after running ray up if we haven't yet started the provisioning process for the requested count of min_workers.

jennakwon06 · 2021-02-11T21:12:57Z

Yes my min_workers was 34, and max_workers was 34 (for debugging purposes I had set it like that). I was perplexed when I couldn't see any workers on EC2 console!

jennakwon06 added the enhancement New feature or request label Feb 10, 2021

pdames self-assigned this Feb 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Improve experience when EC2 does not have capacity for worker nodes #10

[autoscaler] Improve experience when EC2 does not have capacity for worker nodes #10

jennakwon06 commented Feb 10, 2021

pdames commented Feb 10, 2021

jennakwon06 commented Feb 11, 2021

[autoscaler] Improve experience when EC2 does not have capacity for worker nodes #10

[autoscaler] Improve experience when EC2 does not have capacity for worker nodes #10

Comments

jennakwon06 commented Feb 10, 2021

pdames commented Feb 10, 2021

jennakwon06 commented Feb 11, 2021