-
Notifications
You must be signed in to change notification settings - Fork 0
Configuring an AWS ParallelCluster in a CloudShell
A virtual HPC cluster can be created on a cloud platform by creating some instances (one manager and several worker nodes), then configuring them to provide job control functionality with SLURM.
Amazon Web Services (AWS) provides assistance with this task through their ParallelCluster management tool. They also offer a ParallelCluster UI that provides a user-friendly interface to cluster creation and maintenance. The ParallelCluster UI must be installed on your computer before use.
This page describes how to:
- Prepare an instance for using ParallelCluster.
- Configure and create a virtual cluster using the ParallelCluster command line interface.
These instructions assume that you have created and logged in to an AWS instance. The easiest way to do this if through the AWS Console's CloudShell feature. The steps in this section only need to be performed once, after which any number of ParallelClusters can be created. The two preparation tasks are:
- Create and import an SSH key
- Install the AWS ParallelCluster software
An SSH key pair must be stored with your AWS account and included in the ParallelCluster configuration in order for you to log in to the cluster's head node. Initially, logging in with a password is not allowed. The steps below will create a new key pair in AWS EC2 through the web console, but alternative creation methods will also work. For information about AWS EC2 key pairs, see their documentation.
- Visit the console's EC2 dashboard and click on "Key pairs" in the Resources panel at the top or under Network & Security in the panel on the left.
- Click the "Create key pair" button in the upper right (note that existing key pairs can be imported in the Actions dropdown).
- When the "Create key pair" view appears, enter a suitable name for the key (avoiding spaces in the name), select ".pem" as the download format, and click "Create key pair".
- The private key will automatically be downloaded to your browser's default download location in a file having the name you provided.
The key file must now be imported into your CloudShell and prepared for use:
-
In CloudShell, open the Actions menu (upper right) and select "Upload file", then browse to your private key file. The uploaded file will be placed in the home directory.
-
For OpenSSH to accept the file when logging in to the head node, you must restrict its permissions so that only the owner can read or write it. In the CloudShell, issue a command like:
chmod 600 keyfile.pem
Before creating a virtual parallel cluster from this instance, you must install the AWS ParallelCluster Python package. It is recommended that this be done in a virtual environment. The steps below create a virtual environment called "test", install ParallelCluster and also install Node.js, which is required by the underlying AWS Cloud Development Kit (AWS CDK):
-
From the home directory, create and activate a new virtual environment named "test" (or you can install in a different location or with a different name):
python3 -m virtualenv test
source test/bin/activate
-
Install ParallelCluster:
pip3 install aws-parallelcluster
-
Install the Node Version Manager (NVM) and use it to install a version of Node.js that is known to work in this environment. As of this writing, the latest version of NVM (as described in the link above) will produce an error. Instead, specify the most recent working version, 16.8.0:
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.38.0/install.sh | bash
chmod ug+x ~/.nvm/nvm.sh
source ~/.nvm/nvm.sh
nvm install 16.8.0
AWS provides a command that will assist you in creating a configuration file for your parallel cluster. The command prompts you to enter values (or accept default values) for a number of options. A description of this process may help you determine the correct answers to provide. In the steps below, default values are used for any items that are not listed:
-
Run the configuration generator to create a hello-world.yaml configuration file:
pcluster configure --config hello-world.yaml
-
Enter the number of the desired region or accept the default.
-
Choose the number of the SSH private key that you will use.
-
Choose slurm (#1) for the scheduler.
-
Use alinux2 (#1) for the OS.
-
Select 1 queue with a default name.
-
Select 1 for the number of compute resources for the queue (this is the number of resource types).
-
Select the desired maximum number of worker instances. These will not be allocated unless needed.
-
Wait half a minute while some cluster components are created, after which the command prompt returns.
The resulting yaml file should look something like this:
Region: us-east-1
Image:
Os: alinux2
HeadNode:
InstanceType: t2.micro
Networking:
SubnetId: subnet-0bba3aba970c8d14b
Ssh:
KeyName: MyKey
Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: test
ComputeResources:
- Name: t2micro
Instances:
- InstanceType: t2.micro
MinCount: 0
MaxCount: 4
Networking:
SubnetIds:
- subnet-0bba3aba970c8d14b
The configuration tool only sets a few of the many options that can be specified in the configuration file. You may want to edit the file by hand after its creation to add additional options. In particular, for working with I-WRF you will want to have larger volumes on both the head and worker nodes in order to hold the large docker images and the Singularity sandbox. You may also want to allocate a shared volume and mount it in each instance. The example below augments the configuration above in three places (shown in their own blocks) to specify 50 Gb volumes for both types of instances and allocate and mount a 100 Gb volume named "ebs".
Region: us-east-1
Image:
Os: alinux2
HeadNode:
InstanceType: t2.micro
LocalStorage:
RootVolume:
Size: 50
Networking:
SubnetId: subnet-0bba3aba970c8d14b
Ssh:
KeyName: MyKey
Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: test
ComputeResources:
- Name: t2micro
Instances:
- InstanceType: t2.micro
MinCount: 0
MaxCount: 4
ComputeSettings:
LocalStorage:
RootVolume:
Size: 50
Networking:
SubnetIds:
- subnet-0bba3aba970c8d14b
SharedStorage:
- MountDir: /share
Name: ebs
StorageType: Ebs
EbsSettings:
VolumeType: gp3
Size: 100