Skip to content

Deploying Virtual Clusters in the Cloud

Ben Trumbore edited this page Dec 1, 2023 · 12 revisions

Introduction

These pages discuss the tasks one must complete to create and use a virtual compute cluster on a cloud platform. They do not discuss running a particular application (like I-WRF) in the cluster. Any linked examples of performing these tasks on a specific cloud platform will include the platform name in the title (for example discussions of Amazon Web Services will be denoted with AWS and Google Cloud Platform with GCP). For some tasks, parallel examples may be shown for different providers or techniques. However, for some tasks the reader will need to translate examples for one cloud provider so that they work with a different provider.

Overview

The work of creating and using a virtual compute cluster on a cloud platform is broken down here into these sections:

  • Logging in to a shell on the cloud provider
  • Configuring the virtual cluster
  • Starting and testing the virtual cluster
  • Configuring the nodes of the virtual cluster
  • Running a job on the virtual cluster
  • Cleaning up after running a job on the virtual cluster

Logging in to a shell on the cloud provider

The steps to create a virtual HPC cluster on a cloud platform must be performed while logged into a shell on that cloud platform. The traditional way to do this is to create a stand-alone instance for this purpose and then log in to that instance using the ssh command or a tool like PuTTY. Some cloud providers make it easier to perform this task by creating an instance for you and opening up a shell on that instance within a web browser. AWS calls this functionality CloudShell.

Logging in to an AWS instance with CloudShell

Configuring a virtual cluster

This task involves specifying the characteristics of the virtual cluster you would like to have. These may involve the number of worker nodes in the cluster, number of CPUs each has, the operating system of the nodes, their disk volumes, etc. Once you have a configuration that works for your application, it can be reused to recreate the cluster multiple times.

Configuring an AWS ParallelCluster in a CloudShell

Starting and testing the virtual cluster

A new cluster can be created from its configuration information by issuing commands in a shell on the cloud provider. Once the cluster is running, you will want to log on to it in order to test or use it. A cluster can be tested by running a simple job that asks each worker node to report back to the manager.

Starting and testing a ParallelCluster in CloudShell

Configuring the nodes of a virtual cluster

Configuring your cluster involves preparing the raw virtual computers to run your particular job. This may involve installing software, copying source code to the head node and compiling it, deploying containers, copying data files, or configuring systems settings on the nodes. If a virtual cluster is destroyed and then recreated, you will need to perform this configuration again on the new cluster, so it is beneficial to use tools that automate the configuration process.

Configuring a Cluster on AWS

Running a job on the virtual cluster

Once your cluster has been provisioned and configured, you can issue commands to start and manage a job on it. Management may include monitoring or cancelling the job. Slurm is a commonly used tool for starting and managing jobs on clusters.

Running a Job on a Cluster

Cleaning up after running a job

When your job is finished and you want to minimize your ongoing cloud platform expenses, you can choose to either "turn off" your virtual cluster or destroy it completely. Turning it off saves costs for having it running, but still incurs charges for storage and platform "objects" like instances and networks. To avoid all ongoing charges, it is important that you take the cluster down correctly so that it is thoroughly destroyed.

Cleaning Up a Cluster on AWS