Skip to content
This repository has been archived by the owner on Jun 4, 2024. It is now read-only.

[autoscaler] while running ray up, client cannot connect to head node when client node/head node are in the same private subnet. #1

Open
jennakwon06 opened this issue Jan 27, 2021 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@jennakwon06
Copy link
Contributor

What is the problem?

In my setting, the client machine is in the same VPC as the requested instances in the cluster.

When a client machine runs ray up, requirement seems to be that head node must be in a VPC subnet that enables "“Auto-assign public IPv4 address”". When the subnet doesn't enable this (is a private subnet), the client machine cannot connect to the head node; below is the error message.

<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Waiting for IP
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
(ends with never getting the IP address).

Ray version and other system information (Python version, TensorFlow version, OS):
Ray: 1.1.0, Python: 3.7.7
Client machine cat /etc/os-release

NAME="Amazon Linux AMI"
VERSION="2018.03"
ID="amzn"
ID_LIKE="rhel fedora"
VERSION_ID="2018.03"
PRETTY_NAME="Amazon Linux AMI 2018.03"
ANSI_COLOR="0;33"
CPE_NAME="cpe:/o:amazon:linux:2018.03:ga"
HOME_URL="http://aws.amazon.com/amazon-linux-ami/"

Reproduction (REQUIRED)

Note that client is in the SAME VPC as the 4 subnets getting requested for head_node and worker_node.

Below is my cluster_config.yaml; the only important bit though is the 4 SubnetIds.

I confirmed that when I change the subnet setting of head_node's specified SubnetIds to "“Auto-assign public IPv4 address”" (via AWS VPC Console) client is able to find the head node and cluster launch is successful.


cluster_name: jkkwon_ray_test

min_workers: 5

max_workers: 10

upscaling_speed: 1.0

docker:
    image: "rayproject/ray-ml:latest-gpu"
    container_name: "ray_container"
    pull_before_run: True
    run_options: []


idle_timeout_minutes: 5

provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a,us-west-2b,us-west-2c,us-west-2d


auth:
    ssh_user: ubuntu

head_node:
    InstanceType: r5.12xlarge
    ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
    SecurityGroupIds:
        - "sg-08ed97f6d08d451f6"
    SubnetIds: [
        "subnet-0180e9267b994bf97",  # us-west-2a, 8187 IP addresses. 10.0.32.0/19
        "subnet-073e6e0338bf209cb",  # us-west-2b, 8187 IP addresses. 10.0.64.0/19
        "subnet-03caa10b59288efae",  # us-west-2c, 8187 IP addresses. 10.0.96.0/19
        "subnet-06dd6dbb8caf5c310",  # us-west-2d, 8187 IP addresses. 10.0.128.0/19
    ]
    # You can provision additional disk space with a conf as follows
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 100

worker_nodes:
    InstanceType: r5.12xlarge
    ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
    SecurityGroupIds:
        - "sg-08ed97f6d08d451f6"
    SubnetIds: [
        "subnet-0180e9267b994bf97",  # us-west-2a, 8187 IP addresses. 10.0.32.0/19
        "subnet-073e6e0338bf209cb",  # us-west-2b, 8187 IP addresses. 10.0.64.0/19
        "subnet-03caa10b59288efae",  # us-west-2c, 8187 IP addresses. 10.0.96.0/19
        "subnet-06dd6dbb8caf5c310",  # us-west-2d, 8187 IP addresses. 10.0.128.0/19
    ]
    InstanceMarketOptions:
        MarketType: spot


file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Unfortunately I am working in a private AWS account so to reproduce, I suggest getting the SubnetIds that match your AWS account.

  • [ no ] I have verified my script runs in a clean environment and reproduces the issue.
  • [ yes ] I have verified the issue also occurs with the latest wheels.
@jennakwon06 jennakwon06 added the bug Something isn't working label Jan 27, 2021
@pdames pdames self-assigned this Jan 27, 2021
pdames pushed a commit that referenced this issue Jan 30, 2021
pdames pushed a commit that referenced this issue Jan 30, 2021
…unicator caching (#12935)

* other collectives all work

* auto-linting

* mannual linting #1

* mannual linting 2

* bugfix

* add send/recv point-to-point calls

* add some initial code for communicator caching

* auto linting

* optimize imports

* minor fix

* fix unpassed tests

* support more dtypes

* rerun some distributed tests for send/recv

* linting
@pdames
Copy link
Member

pdames commented Feb 4, 2021

@jennakwon06 - could you try including use_internal_ips: True in the provider section of your autoscaler config and ensure that it allows you to run ray up successfully without assigning a public IPv4 address to the head node? For example:

provider:
  type: aws
  region: us-west-2
  availability_zone: us-west-2a,us-west-2b,us-west-2c,us-west-2d
  use_internal_ips: True

Assuming this works, we may just want to consider this an opportunity to improve existing documentation and examples in this area.

@jennakwon06
Copy link
Contributor Author

Hi Patrick!

I verified that use_internal_ips: True works for a subnet with auto-assign public IPv4 address == False. So I believe that solves the original issue. My suggestion is to use use_private_ips though; I think internal and private are synonymous but private seems more widely used ( https://docs.aws.amazon.com/vpc/latest/userguide/vpc-ip-addressing.html#vpc-private-ipv4-addresses )

One remaining issue is the use case of using custom Docker image. I have below to initialization_command for my custom Docker image.

initialization_commands:
      - aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 048211272910.dkr.ecr.us-west-2.amazonaws.com;
docker: 
    image: "048211272910.dkr.ecr.us-west-2.amazonaws.com/jkkwon-batscli:zarr"
    container_name: "miamiml_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True

That fails with below error when auto-assign public IPv4 address is False.

Warning: Permanently added '10.0.0.227' (ECDSA) to the list of known hosts.

Connect timeout on endpoint URL: "https://api.ecr.us-west-2.amazonaws.com/"
Error: Cannot perform an interactive login from a non TTY device
Connection to 10.0.0.227 closed.

Summary of remaining issue, when custom Docker image is in use:

auto-assign public IPv4 address is True auto-assign public IPv4 address is False
use_internal_ips: True ray up works initialization_command fails with 'Cannot perform an interactive login'

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants