This reference architecture provides a set of templates for deploying Open OnDemand (OOD) with AWS CloudFormation and integration points for both AWS ParallelCluster and AWS Parallel Computing Service (AWS PCS).
The primary components of the solution are:
- Application load balancer as the entry point to your OOD portal
- An Auto Scaling Group for the OOD Portal
- A Microsoft ManagedAD Directory
- A Network Load Balancer (NLB) to provide a single point of connectivity to Microsoft ManagedAD
- An Elastic File System (EFS) share for user home directories
- An Aurora MySQL database to store Slurm Accounting data
- Automation via Event Bridge to automatically register and deregister ParallelCluster HPC Clusters with OOD
This solution is compatible with the following HPC service(s) from AWS:
- AWS ParallelCluster v3.13.0
- AWS Parallel Computing Service (AWS PCS)
- Open OnDemand v4.0
The deployment process involves several key steps to set up Open OnDemand with AWS ParallelCluster or AWS PCS integration. Follow these steps carefully to ensure a successful deployment.
- AWS CLI v2 installed and configured with appropriate credentials
- Domain name and hosted zone in Route 53 (required for custom domain setup)
- Basic understanding of AWS ParallelCluster or AWS PCS if planning to integrate HPC clusters
- All-in-one deployment (recommended for first-time users, or sandbox environments)
- Modular deployment (for advanced users)
All in one deployment including infrastructure and Open OnDemand
-
Deploy CloudFormation assets to S3:
./deploy-assets.sh
Note: This script uploads all required CloudFormation templates and assets to an S3 bucket in your AWS account.
-
Deploy the all-in-one stack ood_full.yml via CloudFormation.
Parameter | Description | Type | Default |
---|---|---|---|
DomainName | Domain name not including the top level domain | String | hpclab |
TopLevelDomain | TLD for your domain (i.e. local, com, etc) | String | local |
WebsiteDomainName | Domain name for world facing website | String | - |
HostedZoneId | Hosted Zone Id for Route53 Domain | String | - |
PortalAllowedIPCIDR | IP CIDR for access to the Portal | String | - |
Branch | Branch of the code to deploy. Only use this when testing changes to the solution | String | main |
SlurmVersion | Version of slurm to install | String | 24.05.7 |
DeploymentAssetBucketName | Deployment Asset Bucket Name | String | ood-assets |
Note:
DeploymentAssetBucketName
is the output from step 1 (deploy assets)
Deploy Stacks individually:
- Deploy Infrastructure (Networking, and Managed Active Directory): infra.yml
- Deploy Slurm Accounting Database (only if integrating with ParallelCluster): slurm_accounting_db.yml
- Deploy Open OnDemand: ood.yml
Parameter | Description | Default |
---|---|---|
DomainName | Domain name not including the top level domain | hpclab |
TopLevelDomain | TLD for your domain (i.e. local, com, etc) | local |
WebsiteDomainName | Domain name for world facing website | - |
HostedZoneId | Hosted Zone Id for Route53 Domain | - |
PortalAllowedIPCIDR | IP CIDR for access to the Portal | - |
Branch | Branch of the code to deploy. Only use this when testing changes to the solution | main |
DeploymentAssetBucketName | Deployment Asset Bucket Name | - |
VPC | VPC for OOD deployment | - |
PrivateSubnet | Private subnet for OOD deployment | - |
PublicSubnet | Public subnet for OOD deployment | - |
BindDN | Bind DN for the directory | CN=Admin,OU=Users,OU=hpclab,DC=hpclab,DC=local |
LDAPSearchBase | LDAP Search Base | DC=hpclab,DC=local |
LDAPUri | LDAP URI for Managed AD | - |
BindPasswordSecretArn | BIND Password Secret ARN for Admin user in Managed AD | - |
ClusterConfigBucket | S3 Bucket where Cluster Configuration items are stored | - |
NodeArchitecture | Processor architecture for the login and compute node instances | x86 |
SlurmVersion | Version of slurm to install. Select 24.11.5 or greater if using AWS PCS |
24.05.7 |
AccountingPolicyEnforcement | Specify which Slurm accounting policies to enforce | none |
Once deployed, navigate to the URL
found in the Open OnDemand stack CloudFormation outputs. A default admin
user is created as part of the deployment and can be used to validate login works correctly.
Username | Password |
---|---|
Admin |
Retrieve the secret value from secrets manager. The secret ARN is in the CloudFormation outputs under ADAdministratorSecretArn |
The OOD solution is built to integrate with AWS ParallelCluster HPC Cluster can be created and automatically registered with the portal.
To deploy a ParallelCluster refer to Setting up AWS ParallelCluster to get started. This includes (but not limited to):
- Install AWS ParallelCluster CLI
- Create ParallelCluster configuration file
To integrate a ParallelCluster HPC cluster with Open OnDemand, you need to create a ParallelCluster configuration file that defines:
The configuration file serves as the blueprint for your HPC cluster and ensures proper integration with the Open OnDemand portal for job submission and management.
You can either:
- Use the provided script to automatically generate a configuration file (recommended)
- Manually create a configuration file following the guidelines below
You can automatically generate a ParallelCluster configuration file using the provided scripts/create_sample_pcluster_config.sh script. This script will create a properly configured pcluster configuration file with all the necessary settings for Open OnDemand integration.
Example to create a pcluster-config.yml
file:
./create_sample_pcluster_config.sh ood
Usage:
Usage: ./create_sample_pcluster_config.sh <stack-name> [region] [domain1] [domain2]
stack-name: The name of the stack you deployed
region: The region of the stack you deployed
ad_domain: The LDAP DN (e.g. DC=hpclab,DC=local)
To create the ParallelCluster configuration file refer to the following information:
- SubnetId:
PrivateSubnets
from OOD Stack Output - AdditionalScurityGroups:
HeadNodeSecurityGroup
from CloudFormation Outputs - AdditionalIAMPolicies:
HeadNodeIAMPolicyArn
from CloudFormation Outputs - OnNodeConfigured
- Script: CloudFormation Output for the
ClusterConfigBucket
; in the formats3://$ClusterConfigBucket/pcluster_head_node.sh
- Args: Open OnDemand CloudFormation stack name
- Script: CloudFormation Output for the
- SubnetId:
PrivateSubnets
from OOD Stack Output - AdditionalScurityGroups:
ComputeNodeSecurityGroup
from CloudFormation Outputs - AdditionalIAMPolicies:
ComputeNodeIAMPolicyArn
from CloudFormation Outputs - OnNodeConfigured
- Script: CloudFormation Output for the
ClusterConfigBucket
; in the formats3://$ClusterConfigBucket/pcluster_worker_node.sh
- Args: Open OnDemand CloudFormation stack name
- Script: CloudFormation Output for the
- OnNodeConfigured
- Script: CloudFormation Output for the
ClusterConfigBucket
; in the formats3://$ClusterConfigBucket/configure_login_nodes.sh
- Args: Open OnDemand CloudFormation stack name
- Script: CloudFormation Output for the
The pam_slurm_adopt module can be enabled on Compute nodes in ParallelCluster to prevent users from ssh'ing to nodes they do not have a job running.
In your ParallelCluster config, update the following configuration(s):
- Check if any steps have been launched.
Add the CustomSlurmSetting PrologFlags: "contain"
in the Scheduling section. Refer to slurm configuration documentation for more details on this slurm setting.
Example:
SlurmSettings:
CustomSlurmSettings:
- PrologFlags: "contain"
- Ensure compute nodes are exclusively allocated to users.
Add the CustomSlurmSetting ExclusiveUser: "YES"
in the SlurmQueues section. Refer to slurm partition configuration for more details.
Example:
CustomSlurmSettings:
ExclusiveUser: "YES"
- Add configure_pam_slurm_adopt.sh to OnNodeConfigured in the CustomActions section.
Example:
CustomActions:
OnNodeConfigured:
Sequence:
- Script: s3://$ClusterConfigBucket/pcluster_worker_node.sh
Args:
- Open OnDemand CloudFormation stack name
- Script: s3://$ClusterConfigBucket/configure_pam_slurm_adopt.sh
Note: If you automatically created your ParallelCluster configuration using the recommended approach a desktop
queue was created for you.
You can enable interactive desktops on the Portal server. This can be enabled by creating a queue in ParallelCluster to be used for the desktop sessions.
Note: This requires you to have a compute queue with
pcluster_worker_node_desktop.sh
as yourOnNodeConfigured
script.
Snippet from ParallelCluster config:
CustomActions:
OnNodeConfigured:
Script: >-
s3://{{ClusterConfigBucket}}/pcluster_worker_node_desktop.sh
Args:
- {{OOD_STACK_NAME}}
OOD_STACK_NAME
is the name of your Open OnDemand CloudFormation stack name (e.g.ood
)ClusterConfigBucket
is the ClusterConfigBucket Output from CloudFormation
Slurm configuration can be maintained outside of the Open OnDemand deployment.
The ClusterConfigBucket
S3 bucket (available in CloudFormation Outputs) stores Slurm configurations under the /slurm
prefix. Configuration files that would normally reside in /etc/slurm
can be uploaded to this prefix, and an EventBridge rule will automatically sync them to the Open OnDemand server.
The following configurations are stored by default:
To update the slurm configuration on the Open OnDemand server, copy any configuration file(s) to the ClusterConfigBucket
S3 bucket under the /slurm
prefix. The configuration will be automatically synced to the Open OnDemand server.
This solution is now compatible with AWS Parallel Computing Service (AWS PCS), which provides a fully managed HPC service that simplifies cluster deployment and management. AWS PCS offers several key benefits:
- Managed Infrastructure: AWS handles the underlying infrastructure, including compute, networking, and storage, reducing operational overhead
- Built-in Security: Integrated with AWS security services and best practices for HPC workloads
- Cost Optimization: Pay only for the compute resources you use with flexible scaling options
- Simplified Management: Automated cluster lifecycle management and monitoring through the AWS Management Console
- Native AWS Integration: Seamless integration with other AWS services like Amazon FSx for Lustre and Amazon EFS
Refer to the following getting started guides:
- Get started with AWS PCS - https://docs.aws.amazon.com/pcs/latest/userguide/getting-started.html
- Get started with AWS CloudFormation and AWS PCS - https://docs.aws.amazon.com/pcs/latest/userguide/get-started-cfn.html
- HPC Recipe for getting started with AWS PCS - aws-hpc-recipes/recipes/pcs/getting_started
- This guide includes many helpful CloudFormation templates to get started
Option 1: Use scripts/deploy_pcs.sh script to deploy assets/cloudformation/pcs-starter.yml. This script is a helper utility to help deploy the pcs-starter
template by pulling the parameters from CloudFormation outputs.
- Open AWS CloudShell
- Download
deploy_pcs.sh
curl -o deploy_pcs.sh https://raw.githubusercontent.com/aws-samples/open-on-demand-on-aws/refs/heads/main/scripts/deploy_pcs.sh
- Make the script executable:
chmod +x deploy_pcs.sh
- Run the deployment script:
./deploy_pcs.sh --infra-stack <infra-stack-name> --ood-stack <ood-stack-name>
Script help
Usage: ./scripts/deploy_pcs.sh [options]
Options:
--infra-stack NAME Name of the infra CloudFormation stack (required)
--ood-stack NAME Name of the ood CloudFormation stack (required)
--region REGION AWS region to deploy to (optional, defaults to AWS CLI configured region)
--cluster-name NAME Name of the PCS cluster (optional, defaults to pcs-starter)
--node-architecture ARCH Processor architecture for nodes (optional, defaults to x86)
Allowed values: x86, Graviton
--host-mount-point PATH Mount path on the host (optional, defaults to /shared)
--branch BRANCH Branch of the Open On Demand on AWS repository to use (optional, defaults to main)
--help Display this help message
Example:
./scripts/deploy_pcs.sh --infra-stack infra-stack --ood-stack ood --cluster-name my-pcs-cluster --node-architecture x86 --region us-east-1
Option 2: Manually deploy assets/cloudformation/pcs-starter.yml via CloudFormation
To deploy the CloudFormation template manually:
- Open the AWS CloudFormation console
- Click "Create stack" and select "With new resources (standard)"
- Under "Template source", select "Upload a template file" and upload
pcs-starter.yml
- Click "Next"
- Enter a stack name (e.g.,
pcs-starter
) - Configure the following parameters:
Parameter | Description | Default |
---|---|---|
VPC | VPC for PCS cluster | - |
PrivateSubnet | Private subnet | - |
PublicSubnet | Public subnet | - |
ClusterName | Name of the PCS cluster | - |
HPCClusterSecurityGroupId | Security group for PCS cluster controller and nodes | - |
EFSFileSystemId | EFS file system ID | - |
EfsFilesystemSecurityGroupId | Security group for EFS filesystem | - |
NodeArchitecture | Processor architecture for the login and compute node instances | x86 |
SlurmVersion | Version of Slurm to use | 24.11 |
DomainName | Domain name | hpclab |
TopLevelDomain | Top level domain | local |
ADAdministratorSecret | AD Administrator Secret | - |
BindDN | Bind DN for the directory | CN=Admin,OU=Users,OU=hpclab,DC=hpclab,DC=local |
LDAPSearchBase | LDAP Search Base | DC=hpclab,DC=local |
HostMountPoint | Mount path on the host | /shared |
ClusterConfigBucket | S3 Bucket where Cluster Configuration items are stored | - |
LDAPUri | LDAP URI for Managed AD | - |
BindPasswordSecretArn | BIND Password Secret ARN for Admin user in Managed AD | - |
AccountingPolicyEnforcement | Specify which Slurm accounting policies to enforce | none |
- Click "Next" to configure stack options
- Click "Next" to review
- Check the acknowledgment box and click "Create stack"
The scripts/s3_script_runner.sh script uses AWS Systems Manager (SSM) to send a command to the Open OnDemand EC2 instance. This command executes a configuration script that sets up both sackd
and slurm
services to work with your PCS cluster.
-
Open AWS CloudShell
-
Download
s3_script_runner.sh
curl -o s3_script_runner.sh https://raw.githubusercontent.com/aws-samples/open-on-demand-on-aws/refs/heads/main/scripts/s3_script_runner.sh
- Make the script executable:
chmod +x s3_script_runner.sh
- Configure the required parameters for
s3_script_runner.sh
Setup parameters needed for SSM Send Command to configure the Open OnDemand instance with PCS cluster settings. These parameters include:
- CLUSTER_ID: The ID of the PCS cluster
- CLUSTER_CONFIG_BUCKET: The S3 bucket storing cluster configurations
- INSTANCE_ID: The EC2 instance ID of the Open OnDemand web portal
- OOD_STACK: The name of the Open OnDemand CloudFormation stack
- PCS_CLUSTER_STACK: The name of the PCS Getting Started CloudFormation stack
OOD_STACK="{OOD_STACK_NAME}"
PCS_CLUSTER_STACK="{PCS_CLUSTER_STACK}"
CLUSTER_ID=$(aws cloudformation describe-stacks --stack-name $PCS_CLUSTER_STACK --query "Stacks[0].Outputs[?OutputKey=='ClusterId'].OutputValue" --output text)
CLUSTER_CONFIG_BUCKET=$(aws cloudformation describe-stacks --stack-name $OOD_STACK --query "Stacks[0].Outputs[?OutputKey=='ClusterConfigBucket'].OutputValue" --output text)
INSTANCE_ID=$(aws ec2 describe-instances --filters \
"Name=tag:ood,Values=webportal-ood" \
"Name=instance-state-name,Values=running" \
--query "Reservations[].Instances[].InstanceId" \
--output text)
- Replace
{OOD_STACK_NAME}
with the name of your Open OnDemand stack (e.g.ood
) - Replace
{PCS_CLUSTER_STACK}
with the name of your PCS Getting Started stack (e.g.pcs-starter
)
To validate you have all of the parameters set run the following command:
for var in OOD_STACK PCS_CLUSTER_STACK CLUSTER_ID CLUSTER_CONFIG_BUCKET INSTANCE_ID; do
echo "$var: ${!var:-<null>}"
done
The output should similar to the following:
OOD_STACK: ood
PCS_CLUSTER_STACK: pcs-starter
CLUSTER_ID: pcs_6x5nsf236m
CLUSTER_CONFIG_BUCKET: ood-clusterconfigbucket-nrl56bjgpwru
INSTANCE_ID: i-05a3ce6b349b232f3
- Run
s3_script runner.sh
CLUSTER_NAME="{CLUSTER_NAME}"
COMMAND_ID=$(./s3_script_runner.sh \
--instance-id "$INSTANCE_ID" \
--document-name "${PCS_CLUSTER_STACK}-S3ScriptRunner" \
--bucket-name "$CLUSTER_CONFIG_BUCKET" \
--script-key "configure_ood_for_pcs.sh" \
--script-args "--ood-stack $OOD_STACK --cluster-name $CLUSTER_NAME --cluster-id $CLUSTER_ID --region $AWS_REGION")
- Replace
{CLUSTER_NAME}
with the name of your PCS Cluster (e.g.pcs-starter
)- Note: This is the ClusterName parameter supplied when you deployed the PCS getting started CloudFormation stack
This will output the CommandId
of the command being run (example below)
ccc5375a-e192-4d36-af57-5dd7a7740f0d
- Inspect the SSM results using the following command to verify the configuration was successful. This command will show the detailed output of the script execution, including:
- Command execution status
- Standard output showing the configuration steps
- Any errors that may have occurred
- Execution timing information
aws ssm get-command-invocation \
--command-id $COMMAND_ID \
--instance-id $INSTANCE_ID
Reviewing the result you should see Status == Success
There can be errors submitting jobs after integrating OOD w/ParalleCluster due to slurm registering the cluster. Review the logs found in /var/log/sbatch.log
and check if there are errors related to available clusters.
Sample log entry:
vbatch: error: No cluster 'sandbox-cluster' known by database.
sbatch: error: 'sandbox-cluster' can't be reached now, or it is an invalid entry for --cluster. Use 'sacctmgr list clusters' to see available clusters.
If this occurs, restart both the slurmctld
and slurmdbd
services should be restarted.
systemctl restart slurmctld
systemctl restart slurmdbd
Once restarted check the available clusters to verify the cluster is listed.
sacctmgr list clusters
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file for details.