Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues launching AWS clusters #3781

Closed
jpuerto-psc opened this issue Sep 14, 2021 · 6 comments
Closed

Issues launching AWS clusters #3781

jpuerto-psc opened this issue Sep 14, 2021 · 6 comments

Comments

@jpuerto-psc
Copy link

jpuerto-psc commented Sep 14, 2021

Good Afternoon Team,

Hope all is well!

Wanted to reach out as I am experiencing issues with launching toil on AWS clusters as of late:

toil launch-cluster xxx --leaderNodeType t2.2xlarge -z us-east-2a --keyPairName xxx --leaderStorage 100
[2021-09-14T11:39:41-0400] [MainThread] [I] [toil] Using default docker registry of quay.io/ucsc_cgl as TOIL_DOCKER_REGISTRY is not set.
[2021-09-14T11:39:41-0400] [MainThread] [I] [toil] Using default docker name of toil as TOIL_DOCKER_NAME is not set.
[2021-09-14T11:39:41-0400] [MainThread] [I] [toil] Using default docker appliance of quay.io/ucsc_cgl/toil:5.4.0-87293d63fa6c76f03bed3adf93414ffee67bf9a7-py3.6 as TOIL_APPLIANCE_SELF is not set.
[2021-09-14T11:39:41-0400] [MainThread] [I] [toil.utils.toilLaunchCluster] Creating cluster xxx...
[2021-09-14T11:39:44-0400] [MainThread] [I] [toil] Using default user-defined custom docker init command of  as TOIL_CUSTOM_DOCKER_INIT_COMMAND is not set.
[2021-09-14T11:39:44-0400] [MainThread] [I] [toil] Using default user-defined custom init command of  as TOIL_CUSTOM_INIT_COMMAND is not set.
[2021-09-14T11:39:44-0400] [MainThread] [I] [toil] Using default docker registry of quay.io/ucsc_cgl as TOIL_DOCKER_REGISTRY is not set.
[2021-09-14T11:39:44-0400] [MainThread] [I] [toil] Using default docker name of toil as TOIL_DOCKER_NAME is not set.
[2021-09-14T11:39:44-0400] [MainThread] [I] [toil] Using default docker appliance of quay.io/ucsc_cgl/toil:5.4.0-87293d63fa6c76f03bed3adf93414ffee67bf9a7-py3.6 as TOIL_APPLIANCE_SELF is not set.
[2021-09-14T11:39:45-0400] [MainThread] [I] [toil.lib.ec2] Selected Flatcar AMI: ami-07e82385de8861b75
[2021-09-14T11:39:45-0400] [MainThread] [I] [toil.lib.ec2] Creating t2.2xlarge instance(s) ...
[2021-09-14T11:39:51-0400] [MainThread] [I] [toil.lib.ec2] Creating t2.2xlarge instance(s) ...
[2021-09-14T11:40:26-0400] [MainThread] [I] [toil.provisioners.node] Attempting to establish SSH connection...
[2021-09-14T11:40:27-0400] [MainThread] [I] [toil.provisioners.node] ...SSH connection established.
[2021-09-14T11:40:27-0400] [MainThread] [I] [toil.provisioners.node] Waiting for docker on xxx to start...
[2021-09-14T11:40:28-0400] [MainThread] [I] [toil.provisioners.node] Docker daemon running
[2021-09-14T11:40:28-0400] [MainThread] [I] [toil.provisioners.node] Waiting for toil_leader Toil appliance to start...
[2021-09-14T11:40:28-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:40:48-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:41:09-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:41:30-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:41:50-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:42:11-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:42:31-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:42:52-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:43:12-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:43:33-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:43:53-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:44:13-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:44:34-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:44:54-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:45:15-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:45:35-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:45:56-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:46:16-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:46:37-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:46:57-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:47:18-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
Traceback (most recent call last):
  File "/Users/jpuerto/toil-test/venv/bin/toil", line 8, in <module>
    sys.exit(main())
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/utils/toilMain.py", line 31, in main
    get_or_die(module, 'main')()
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/utils/toilLaunchCluster.py", line 168, in main
    awsEc2ExtraSecurityGroupIds=options.awsEc2ExtraSecurityGroupIds)
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/provisioners/aws/awsProvisioner.py", line 298, in launchCluster
    leaderNode.waitForNode('toil_leader')
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/provisioners/node.py", line 75, in waitForNode
    self._waitForAppliance(role=role, keyName=keyName)
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/provisioners/node.py", line 171, in _waitForAppliance
    "\nCheck if TOIL_APPLIANCE_SELF is set correctly and the container exists.")
RuntimeError: Appliance failed to start on machine with IP: xxxx
Check if TOIL_APPLIANCE_SELF is set correctly and the container exists.

Any ideas on what might be going on here? I just updated to latest toil this morning. Please let me know if there is any additional information that might help with debugging.

Best regards,

Juan

┆Issue is synchronized with this Jira Task
┆Issue Number: TOIL-1013

@jpuerto-psc
Copy link
Author

This seems to be an issue with the docker image being chosen. By default, it is using the 5.4.0 image (this is the Toil version I am using). However, when I set it to use quay.io/ucsc_cgl/toil:latest, toil can create the cluster without issue:

toil launch-cluster xxx --leaderNodeType t2.2xlarge -z us-east-2a --keyPairName xxx --leaderStorage 100
[2021-09-20T10:34:10-0400] [MainThread] [I] [toil] Using default docker registry of quay.io/ucsc_cgl as TOIL_DOCKER_REGISTRY is not set.
[2021-09-20T10:34:10-0400] [MainThread] [I] [toil] Using default docker name of toil as TOIL_DOCKER_NAME is not set.
[2021-09-20T10:34:10-0400] [MainThread] [I] [toil] Overriding docker appliance of quay.io/ucsc_cgl/toil:5.4.0-87293d63fa6c76f03bed3adf93414ffee67bf9a7-py3.6 with quay.io/ucsc_cgl/toil:latest from TOIL_APPLIANCE_SELF.
[2021-09-20T10:34:11-0400] [MainThread] [I] [toil.utils.toilLaunchCluster] Creating cluster xxx...
[2021-09-20T10:34:13-0400] [MainThread] [I] [toil] Using default user-defined custom docker init command of  as TOIL_CUSTOM_DOCKER_INIT_COMMAND is not set.
[2021-09-20T10:34:13-0400] [MainThread] [I] [toil] Using default user-defined custom init command of  as TOIL_CUSTOM_INIT_COMMAND is not set.
[2021-09-20T10:34:13-0400] [MainThread] [I] [toil] Using default docker registry of quay.io/ucsc_cgl as TOIL_DOCKER_REGISTRY is not set.
[2021-09-20T10:34:13-0400] [MainThread] [I] [toil] Using default docker name of toil as TOIL_DOCKER_NAME is not set.
[2021-09-20T10:34:13-0400] [MainThread] [I] [toil] Overriding docker appliance of quay.io/ucsc_cgl/toil:5.4.0-87293d63fa6c76f03bed3adf93414ffee67bf9a7-py3.6 with quay.io/ucsc_cgl/toil:latest from TOIL_APPLIANCE_SELF.
[2021-09-20T10:34:14-0400] [MainThread] [I] [toil.lib.ec2] Selected Flatcar AMI: ami-07e82385de8861b75
[2021-09-20T10:34:14-0400] [MainThread] [I] [toil.lib.ec2] Creating t2.2xlarge instance(s) ...
[2021-09-20T10:34:20-0400] [MainThread] [I] [toil.lib.ec2] Creating t2.2xlarge instance(s) ...
[2021-09-20T10:34:58-0400] [MainThread] [I] [toil.provisioners.node] Attempting to establish SSH connection...
[2021-09-20T10:35:00-0400] [MainThread] [I] [toil.provisioners.node] ...SSH connection established.
[2021-09-20T10:35:00-0400] [MainThread] [I] [toil.provisioners.node] Waiting for docker on xxx to start...
[2021-09-20T10:35:00-0400] [MainThread] [I] [toil.provisioners.node] Docker daemon running
[2021-09-20T10:35:00-0400] [MainThread] [I] [toil.provisioners.node] Waiting for toil_leader Toil appliance to start...
[2021-09-20T10:35:00-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-20T10:35:21-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-20T10:35:41-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-20T10:36:02-0400] [MainThread] [I] [toil.provisioners.node] ...Toil appliance started
Warning: Permanently added 'xxx' (ECDSA) to the list of known hosts.
total: matches=0  hash_hits=0  false_alarms=0 data=429
[2021-09-20T10:36:03-0400] [MainThread] [I] [toil.utils.toilLaunchCluster] Cluster created successfully.

@jpuerto-psc
Copy link
Author

Follow up: When using the latest image, I am having issues with the toil-cwl-runner:

[2021-09-20T15:10:33+0000] [MainThread] [I] [toil.worker] Loaded body Job('CWLJob' /opt/fastqc_wrapper.py 88970311-6ded-4a4e-82e8-35c2b71c45cb) from description 'CWLJob' /opt/fastqc_wrapper.py 88970311-6ded-4a4e-82e8-35c2b71c45cb
	Traceback (most recent call last):
	  File "/usr/local/lib/python3.8/dist-packages/toil/cwl/cwltoil.py", line 621, in visit
	    deref = downloadHttpFile(path)
	  File "/usr/local/lib/python3.8/dist-packages/cwltool/utils.py", line 441, in downloadHttpFile
	    r = cache_session.get(httpurl, stream=True)
	  File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 555, in get
	    return self.request('GET', url, **kwargs)
	  File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 542, in request
	    resp = self.send(prep, **send_kwargs)
	  File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 649, in send
	    adapter = self.get_adapter(url=request.url)
	  File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 742, in get_adapter
	    raise InvalidSchema("No connection adapters were found for {!r}".format(url))
	requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///tmp/hca-immune-scrnaseq/MantonCB7_HiSeq_6_S22_L006_R2_001.fastq.gz'

This issue was being tracked in #3667 and I had thought that included as part of the latest release?

@jonathanxu18
Copy link
Contributor

Hi Juan,

Thanks for reaching out!

It looks like quay.io/ucsc_cgl/toil:5.4.0-87293d63fa6c76f03bed3adf93414ffee67bf9a7-py3.6 should refer to tag releases/5.4.0 from looking at 87293d6, so Toil should know to look for an appliance of itself which would be quay.io/ucsc_cgl/toil:5.4.0-py3.6. But this may not be working right now since the Toil releases point to themselves using the commit-hash tag instead of the :5.4.0-py3.6 style tag and we accidentally deleted the commit-hash tags when we were trying to clean up quay.io in #3772. We're trying to see if we can re-make the commit tags for the release versions.

For now, you could run make push_docker to build a Docker appliance for the commit and push it up to your own repository e.g. quay.io/jpuerto-psc/toil and then set TOIL_APPLIANCE_SELF=quay.io/jpuerto-psc/toil and Toil would look for quay.io/jpuerto-psc/toil:5.4.0-87293d63fa6c76f03bed3adf93414ffee67bf9a7-py3.6 which would exist.

For the issues with toil-cwl-runner, the :latest tag may just be a 5.4.0 image so it wouldn't have the changes tracked in #3667. To see what the :latest image is you can go to https://quay.io/repository/ucsc_cgl/toil?tab=tags and filter to latest and if you click on the hash click on the right side of the table it should show it was created with a 5.4.0 image which is the latest release.

@jpuerto-psc
Copy link
Author

jpuerto-psc commented Sep 22, 2021

@jonathanxu18 thanks for all of this! I had to use dockerhub since I don't have quay.io access, but this seemed to work fine. The tag was a bit odd 5.4.0-87293d63fa6c76f03bed3adf93414ffee67bf9a7-dirty-py3.7. Also, when I launched the cluster, I get this as the MOTD:

This is the Toil appliance. You can run your Toil script directly on the appliance.
Run toil <workflow>.py --help to see all options for running your workflow.
For more information see http://toil.readthedocs.io/en/latest/

Copyright (C) 2015-2020 Regents of the University of California

Version: quay.io/ucsc_cgl/toil:5.5.0a1-32dcbf27cf66b6a7015d381abf2414d9dbf41570-py3.7

I did build this from the releases/5.4.0 tag - so not sure why its showing both your repository and the wrong tag in the message.

The only other thing is the active issue in #3667 - is this not part of the 5.4.0 release then? If this is the case, would it be okay for me to make push_docker using the latest master? That should solve this issue since those changes are merged into master, right?

@jonathanxu18
Copy link
Contributor

jonathanxu18 commented Sep 28, 2021

@jpuerto-psc The tag has 'dirty' in it since you may have had changes in git that weren't committed yet when you created the docker image. You can take a look here to see when it determines to include it.

The MOTD uses what TOIL_APPLIANCE_SELF is set to when you run make push_docker, so it might be showing the wrong tag if you set it after the image was created. This shows what TOIL_APPLIANCE_SELF defaults to.

Yep, those changes should be merged into master so you could make push_docker from that. The 5.5.0 release also just got pushed this past weekend so you can use that image too.

@adamnovak
Copy link
Member

We're going to open some new issue(s) about making our reported versions/release tag names make a bit more sense. It sounds like you managed to get a working setup @jpuerto-psc so I am going to close this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants