[autoscaler] while running `ray up`, client cannot connect to head node when client node/head node are in the same private subnet. #1

jennakwon06 · 2021-01-27T18:16:54Z

What is the problem?

In my setting, the client machine is in the same VPC as the requested instances in the cluster.

When a client machine runs ray up, requirement seems to be that head node must be in a VPC subnet that enables "“Auto-assign public IPv4 address”". When the subnet doesn't enable this (is a private subnet), the client machine cannot connect to the head node; below is the error message.

<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Waiting for IP
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
(ends with never getting the IP address).

Ray version and other system information (Python version, TensorFlow version, OS):
Ray: 1.1.0, Python: 3.7.7
Client machine cat /etc/os-release

NAME="Amazon Linux AMI"
VERSION="2018.03"
ID="amzn"
ID_LIKE="rhel fedora"
VERSION_ID="2018.03"
PRETTY_NAME="Amazon Linux AMI 2018.03"
ANSI_COLOR="0;33"
CPE_NAME="cpe:/o:amazon:linux:2018.03:ga"
HOME_URL="http://aws.amazon.com/amazon-linux-ami/"

Reproduction (REQUIRED)

Note that client is in the SAME VPC as the 4 subnets getting requested for head_node and worker_node.

Below is my cluster_config.yaml; the only important bit though is the 4 SubnetIds.

I confirmed that when I change the subnet setting of head_node's specified SubnetIds to "“Auto-assign public IPv4 address”" (via AWS VPC Console) client is able to find the head node and cluster launch is successful.


cluster_name: jkkwon_ray_test

min_workers: 5

max_workers: 10

upscaling_speed: 1.0

docker:
    image: "rayproject/ray-ml:latest-gpu"
    container_name: "ray_container"
    pull_before_run: True
    run_options: []


idle_timeout_minutes: 5

provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a,us-west-2b,us-west-2c,us-west-2d


auth:
    ssh_user: ubuntu

head_node:
    InstanceType: r5.12xlarge
    ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
    SecurityGroupIds:
        - "sg-08ed97f6d08d451f6"
    SubnetIds: [
        "subnet-0180e9267b994bf97",  # us-west-2a, 8187 IP addresses. 10.0.32.0/19
        "subnet-073e6e0338bf209cb",  # us-west-2b, 8187 IP addresses. 10.0.64.0/19
        "subnet-03caa10b59288efae",  # us-west-2c, 8187 IP addresses. 10.0.96.0/19
        "subnet-06dd6dbb8caf5c310",  # us-west-2d, 8187 IP addresses. 10.0.128.0/19
    ]
    # You can provision additional disk space with a conf as follows
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 100

worker_nodes:
    InstanceType: r5.12xlarge
    ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
    SecurityGroupIds:
        - "sg-08ed97f6d08d451f6"
    SubnetIds: [
        "subnet-0180e9267b994bf97",  # us-west-2a, 8187 IP addresses. 10.0.32.0/19
        "subnet-073e6e0338bf209cb",  # us-west-2b, 8187 IP addresses. 10.0.64.0/19
        "subnet-03caa10b59288efae",  # us-west-2c, 8187 IP addresses. 10.0.96.0/19
        "subnet-06dd6dbb8caf5c310",  # us-west-2d, 8187 IP addresses. 10.0.128.0/19
    ]
    InstanceMarketOptions:
        MarketType: spot


file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Unfortunately I am working in a private AWS account so to reproduce, I suggest getting the SubnetIds that match your AWS account.

[ no ] I have verified my script runs in a clean environment and reproduces the issue.
[ yes ] I have verified the issue also occurs with the latest wheels.

The text was updated successfully, but these errors were encountered:

…unicator caching (#12935) * other collectives all work * auto-linting * mannual linting #1 * mannual linting 2 * bugfix * add send/recv point-to-point calls * add some initial code for communicator caching * auto linting * optimize imports * minor fix * fix unpassed tests * support more dtypes * rerun some distributed tests for send/recv * linting

pdames · 2021-02-04T08:23:41Z

@jennakwon06 - could you try including use_internal_ips: True in the provider section of your autoscaler config and ensure that it allows you to run ray up successfully without assigning a public IPv4 address to the head node? For example:

provider:
  type: aws
  region: us-west-2
  availability_zone: us-west-2a,us-west-2b,us-west-2c,us-west-2d
  use_internal_ips: True

Assuming this works, we may just want to consider this an opportunity to improve existing documentation and examples in this area.

jennakwon06 · 2021-02-08T19:49:31Z

Hi Patrick!

I verified that use_internal_ips: True works for a subnet with auto-assign public IPv4 address == False. So I believe that solves the original issue. My suggestion is to use use_private_ips though; I think internal and private are synonymous but private seems more widely used ( https://docs.aws.amazon.com/vpc/latest/userguide/vpc-ip-addressing.html#vpc-private-ipv4-addresses )

One remaining issue is the use case of using custom Docker image. I have below to initialization_command for my custom Docker image.

initialization_commands:
      - aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 048211272910.dkr.ecr.us-west-2.amazonaws.com;

docker: 
    image: "048211272910.dkr.ecr.us-west-2.amazonaws.com/jkkwon-batscli:zarr"
    container_name: "miamiml_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True

That fails with below error when auto-assign public IPv4 address is False.

Warning: Permanently added '10.0.0.227' (ECDSA) to the list of known hosts.

Connect timeout on endpoint URL: "https://api.ecr.us-west-2.amazonaws.com/"
Error: Cannot perform an interactive login from a non TTY device
Connection to 10.0.0.227 closed.

Summary of remaining issue, when custom Docker image is in use:

	`auto-assign public IPv4 address` is `True`	`auto-assign public IPv4 address` is `False`
`use_internal_ips: True`	`ray up` works	`initialization_command` fails with 'Cannot perform an interactive login'

jennakwon06 added the bug Something isn't working label Jan 27, 2021

pdames self-assigned this Jan 27, 2021

pdames pushed a commit that referenced this issue Jan 30, 2021

[RLlib] JAXPolicy prep. PR #1. (#13077)

99ae7ba

rkenmi pushed a commit that referenced this issue Mar 8, 2022

[RLlib] Decentralized multi-agent learning; PR #1 (#21421)

90c6b10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] while running `ray up`, client cannot connect to head node when client node/head node are in the same private subnet. #1

[autoscaler] while running `ray up`, client cannot connect to head node when client node/head node are in the same private subnet. #1

jennakwon06 commented Jan 27, 2021

pdames commented Feb 4, 2021

jennakwon06 commented Feb 8, 2021

[autoscaler] while running ray up, client cannot connect to head node when client node/head node are in the same private subnet. #1

[autoscaler] while running ray up, client cannot connect to head node when client node/head node are in the same private subnet. #1

Comments

jennakwon06 commented Jan 27, 2021

What is the problem?

Reproduction (REQUIRED)

pdames commented Feb 4, 2021

jennakwon06 commented Feb 8, 2021

[autoscaler] while running `ray up`, client cannot connect to head node when client node/head node are in the same private subnet. #1

[autoscaler] while running `ray up`, client cannot connect to head node when client node/head node are in the same private subnet. #1