This repository provides reference architectures and deployment templates for setting up distributed training clusters using AWS Parallel Computing Service (PCS). AWS Parallel Computing Service is a fully managed service that makes it easy to run and scale HPC workloads using Slurm scheduler. These architectures are optimized for machine learning workloads and include configurations for high-performance computing instances (P and Trn EC2 families) with shared filesystems (FSx for Lustre and OpenZFS).
Upstream Repository: These templates are based on aws-samples/aws-hpc-recipes, customized for ML workloads: container support (Enroot/Pyxis) installable at first boot without an AMI build, built-in monitoring, updated Slurm versions (25.05/25.11), and dedicated P5/P6 multi-NIC EFA templates. The templates in this repository are maintained independently and may diverge from the upstream recipes.
- One click to an ML-training-ready cluster: a single CloudFormation stack gives you a complete, ready-to-train environment — Slurm scheduler, GPU compute with EFA, shared FSx storage, the Enroot/Pyxis container runtime, and monitoring — with only the Availability Zone to choose. Submit distributed training jobs minutes after launch.
- Container runtime included: Enroot/Pyxis is set up automatically, so
srun --container-image=...works out of the box for containerized training. - Monitoring built in: Prometheus + Grafana + GPU (DCGM) dashboards deploy automatically on the login node (
DeployMonitoring=true, on by default); reach it privately via SSM port-forward, or open it to a trusted CIDR withGrafanaPublicAccessCidr. - GPU-ready, multi-NIC EFA: dedicated launch templates for the P5 and P6 families, selected automatically by instance type, for high-bandwidth multi-node training.
- Broad capacity-purchase support: covers the full range of EC2 capacity options out of the box — On-Demand, On-Demand Capacity Reservations (ODCR), and Capacity Blocks for ML — selected per node group.
- High-performance storage: FSx for Lustre (shared scratch,
/fsx) and FSx for OpenZFS (home directories,/home). - Modular components: compose individual stacks (network/storage prerequisites, cluster scheduler, per-family compute node groups) instead of the all-in-one nested stack when you want to reuse infrastructure across clusters or iterate on one piece at a time.
Built on the AWS-managed PCS-Ready DLAMI (NVIDIA driver, CUDA, PCS agent, and Slurm 25.05/25.11 pre-installed), so no custom AMI build is required by default — the cluster comes up without an Image Builder step. For frequent scaling, you can pre-bake Enroot/Pyxis into a custom DLAMI with the standalone
pcs-ready-dlami-with-enroot-pyxis.yamltemplate and pass the result asAmiId.
A default deployment (pcs-ml-cluster-deploy-all.yaml) provisions:
- VPC with public/private subnets, NAT gateway, and S3 endpoint
- FSx for Lustre (
/fsx, high-performance shared scratch) and FSx for OpenZFS (/home) - PCS cluster with the Slurm scheduler (25.05 or 25.11), on the PCS-Ready DLAMI
- Login node group (public subnet) with the monitoring stack (Prometheus/Grafana/DCGM)
- CPU compute node group (private subnet); optional GPU (P5/P6) node group with EFA
- Enroot/Pyxis container runtime installed at first boot via
PostInstallScriptUrl(or pre-baked into a custom AMI you build separately and pass asAmiId)
Deploy a complete cluster with one nested CloudFormation stack:
The only decision you must make is which Availability Zone to deploy into
(PrimarySubnetAZ) — everything else has a sensible default. The minimal CLI
equivalent (set your AZ in the first line):
AZ_ID=us-east-1a # <-- the one required choice: your target Availability Zone
aws cloudformation create-stack \
--stack-name pcs-ml-cluster \
--template-url https://awsome-distributed-ai.s3.amazonaws.com/templates/pcs-ml-cluster-deploy-all.yaml \
--parameters ParameterKey=PrimarySubnetAZ,ParameterValue=${AZ_ID} \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAMThis brings up (≈25–30 min, mostly VPC/FSx): 1 login node (m6i.4xlarge) with monitoring,
a cpu1 queue (c6i.4xlarge, 0–4 nodes, dynamic scaling), and Enroot/Pyxis on every node.
Add a GPU queue and tune storage/monitoring via the parameters below.
Once it's up:
- Connect to the login node via SSM Session Manager — see Accessing the Cluster.
- Open the Grafana dashboards (deployed by default) via SSM port forwarding — see Accessing Grafana.
- Want to reach Grafana directly in a browser (no port forwarding)? Set
GrafanaPublicAccessCidrto a trusted CIDR at deploy time — see Option B — Direct public access.
Prefer step-by-step instructions? See the AI/ML for AWS PCS Workshop.
Clean up. When you're done, delete the stack — either from the CloudFormation Management Console (select the stack → Delete) or via the CLI:
aws cloudformation delete-stack --stack-name pcs-ml-clusterNested stacks are deleted automatically. Back up any FSx data first — the filesystems are deleted with the stack.
Defaults give the most common production setup — Enroot/Pyxis installed at first
boot via PostInstallScriptUrl + DeployMonitoring=true — so a default deploy
only needs the Availability Zone (PrimarySubnetAZ). The most-used parameters:
| Parameter | Default | Purpose |
|---|---|---|
PrimarySubnetAZ |
(required) | Availability Zone to deploy into — the one required parameter |
SlurmVersion |
25.11 |
Slurm version (25.05 or 25.11); 25.11 is needed for the Slurm OpenMetrics dashboards. Drives Pyxis build version too. See OPERATIONS.md §1 |
AmiId |
(empty → SSM auto) | Empty auto-resolves to the latest PCS-Ready Deep Learning AMI (Ubuntu 24.04) from SSM. Pin to a specific ami-xxx for production, or pass an AMI built by pcs-ready-dlami-with-enroot-pyxis.yaml |
DeployMonitoring |
true |
Deploy Prometheus/Grafana/DCGM on the login node |
DeployOnDemandCNG |
true |
Deploy the cpu1 CPU queue (OnDemandInstanceType, default c6i.4xlarge) |
DeployPseriesCNG |
false |
Deploy a GPU (P5/P6) queue — see GPU compute |
PseriesInstanceType |
p5.48xlarge |
GPU instance type; auto-selects the multi-NIC template + EFA count |
CapacityReservationId |
(empty) | Capacity Block ID for the GPU queue; empty for On-Demand/ODCR |
See PARAMETERS.md for the complete parameter reference (all 7 console parameter groups, with every default). The concept guides below cover the choices that need the most thought.
By default, every node runs scripts/install-enroot-pyxis.sh
at first boot via PostInstallScriptUrl — no AMI build, ~8–12 min node boot. The
PCS-Ready DLAMI base already includes the PCS agent, Slurm 25.05 & 25.11
(/opt/aws/pcs/scheduler/slurm-*), NVIDIA driver + CUDA, and SSM agent, so the script
only adds Enroot 3.5.0 + Pyxis 0.20.0 on top.
For frequent scaling in production, you can instead pre-bake Enroot/Pyxis into a
custom DLAMI and set AmiId to that ami-xxx. Boot time drops to ~3 min, and the AMI
gives you deterministic state across all nodes. See
Pre-baking Enroot/Pyxis into a custom AMI (optional)
below — this is a separate template you run once, then pass its output as AmiId to
the cluster.
The post-install hook is idempotent: if you supply a custom AMI that already has
Enroot/Pyxis baked in, the default PostInstallScriptUrl becomes a fast no-op. For
the cleanest boot, set PostInstallScriptUrl="" when using a pre-baked AMI.
AmiId is shared by every node group (login + compute). The default (empty)
auto-resolves to the latest PCS-Ready Deep Learning AMI (Ubuntu 24.04, x86_64) from
the SSM public parameter
/aws/service/pcs/ami/dlami-base-ubuntu2404/x86_64/latest/ami-id — so a fresh deploy
needs no AMI choice at all. To use a custom AMI (e.g. one with Enroot/Pyxis pre-baked,
see Pre-baking Enroot/Pyxis into a custom AMI (optional)),
pass its ami-xxx here.
Production tip — pin the AMI. CloudFormation re-resolves SSM
/latest/parameters on every stack update, so a later scale-out can drift onto a newer AMI than the original nodes. For production, resolve the parameter once and pass the resultingami-xxxasAmiId. Details: OPERATIONS.md §4.
Different P-series instances expose different numbers of EFA interfaces, so each family
has its own launch template with the right interface layout. With deploy-all you just
set PseriesInstanceType and the matching template (and interface count) is selected
automatically.
| Instance type | GPUs | EFA interfaces | Template |
|---|---|---|---|
p5.48xlarge |
8× H100 | 32 | add-cng-p5.yaml |
p5e.48xlarge |
8× H200 | 32 | add-cng-p5.yaml |
p5en.48xlarge |
8× H200 | 16 | add-cng-p5.yaml |
p6-b200.48xlarge |
8× B200 | 8 | add-cng-p6-b200.yaml |
p6-b300.48xlarge |
8× B300 | 16 (of 17 interfaces; the primary is ENA-only) | add-cng-p6-b300.yaml |
Capacity options:
- On-Demand: leave
CapacityReservationIdempty. - On-Demand Capacity Reservation (ODCR): also leave
CapacityReservationIdempty — create the ODCR with "open" instance matching and it is consumed automatically by the node group's On-Demand launches. (Do not put the ODCR ID inCapacityReservationId; that parameter forces Capacity-Block mode.) - Capacity Blocks for ML: set
CapacityReservationIdto the Capacity Block ID. The template then launches withMarketType=capacity-blockagainst it.
Capacity Block billing: a block bills for its whole reserved window once it starts and cannot be stopped early. When the block is active, run the GPU node group at
PseriesMinCount = PseriesMaxCount = <reserved count>so the reserved nodes launch immediately, rather than scaling from 0.
FSx deployment types are not available in every Region. Defaults match the most capable type; switch to a more widely available one if your Region needs it.
| Filesystem | Parameter | Default | Other values | Notes |
|---|---|---|---|---|
Lustre (/fsx) |
LustreDeploymentType |
PERSISTENT_2 |
PERSISTENT_1 |
PERSISTENT_2 (throughput 125/250/500/1000, metadata config) isn't in every Region; PERSISTENT_1 (50/100/200) is in more Regions |
Lustre (/fsx) |
PerUnitStorageThroughput |
250 |
any valid number | Must be valid for the type: P2 = 125/250/500/1000, P1 = 50/100/200 |
OpenZFS (/home) |
OpenZFSDeploymentType |
SINGLE_AZ_HA_2 |
SINGLE_AZ_HA_1, SINGLE_AZ_2, SINGLE_AZ_1 |
SINGLE_AZ_1 is in all Regions; HA/2 variants vary. MULTI_AZ excluded (needs a second subnet) |
OpenZFS (/home) |
HomeThroughput |
320 |
any valid number | Throughput (MB/s). Valid values depend on the deployment type: SINGLE_AZ_2/SINGLE_AZ_HA_2 = 160/320/640/1280/2560/3840/5120/7680/10240; SINGLE_AZ_HA_1 = 128/256/512/1024/2048/3072/4096; SINGLE_AZ_1 = 64/128/256/512/1024/2048/3072/4096 |
Check support before deploying: Lustre Regions · OpenZFS Regions. If a deploy fails at the FSx resource with an "unsupported deployment type" error, switch these parameters to a type your Region supports.
All examples start by setting AZ_ID — the one required choice.
AZ_ID=us-east-1a # your target Availability Zone
aws cloudformation create-stack \
--stack-name cpu-cluster \
--template-url https://awsome-distributed-ai.s3.amazonaws.com/templates/pcs-ml-cluster-deploy-all.yaml \
--parameters ParameterKey=PrimarySubnetAZ,ParameterValue=${AZ_ID} \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM1 login node + cpu1 queue (c6i.4xlarge, 0–4 nodes, dynamic scaling).
AZ_ID=us-east-1a # your target Availability Zone
aws cloudformation create-stack \
--stack-name gpu-cluster \
--template-url https://awsome-distributed-ai.s3.amazonaws.com/templates/pcs-ml-cluster-deploy-all.yaml \
--parameters \
ParameterKey=PrimarySubnetAZ,ParameterValue=${AZ_ID} \
ParameterKey=OnDemandCngName,ParameterValue=gpu-g6 \
ParameterKey=OnDemandQueueName,ParameterValue=gpu-g6 \
ParameterKey=OnDemandInstanceType,ParameterValue=g6.12xlarge \
ParameterKey=OnDemandMaxCount,ParameterValue=8 \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAMReplaces the default cpu1 queue with a gpu-g6 queue of g6.12xlarge instances.
AZ_ID=us-west-2b
CAPACITY_RESERVATION_ID="cr-0a1b2c3d4e5f6g7h8"
aws cloudformation create-stack \
--stack-name p6-b300-cb-cluster \
--template-url https://awsome-distributed-ai.s3.amazonaws.com/templates/pcs-ml-cluster-deploy-all.yaml \
--parameters \
ParameterKey=PrimarySubnetAZ,ParameterValue=${AZ_ID} \
ParameterKey=DeployPseriesCNG,ParameterValue=true \
ParameterKey=PseriesCngName,ParameterValue=gpu-p6b300 \
ParameterKey=PseriesQueueName,ParameterValue=gpu-p6b300 \
ParameterKey=PseriesInstanceType,ParameterValue=p6-b300.48xlarge \
ParameterKey=PseriesMinCount,ParameterValue=2 \
ParameterKey=PseriesMaxCount,ParameterValue=2 \
ParameterKey=CapacityReservationId,ParameterValue=${CAPACITY_RESERVATION_ID} \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAMThe add-cng-p6-b300.yaml template is selected automatically from PseriesInstanceType,
and the EFA interface count is derived from the instance type — no interface-count
parameter to set. For p6-b200.48xlarge or any P5 type, just change
PseriesInstanceType. CapacityReservationId here is the Capacity Block ID; for
On-Demand or an "open" ODCR, leave it empty (see GPU compute).
Connect to the login node with AWS Systems Manager Session Manager (no public SSH needed).
Console: EC2 Console → filter
by tag aws:pcs:compute-node-group-name = login → select the instance → Connect →
Session Manager → Connect.
CLI (needs ec2:DescribeInstances + ssm:StartSession; AWS CloudShell has these):
INSTANCE_ID=$(aws ec2 describe-instances \
--filters "Name=tag:aws:pcs:compute-node-group-name,Values=login" \
"Name=instance-state-name,Values=running" \
--query 'Reservations[0].Instances[0].InstanceId' --output text)
aws ssm start-session --target $INSTANCE_IDThen switch to the default user and check the cluster:
sudo su - ubuntu
sinfo # partitions and nodes
squeue # job queue
scontrol show nodes # node detailSee Connect to Cluster in the workshop for more.
The repo's canonical NCCL launcher
micro-benchmarks/nccl-tests/slurm/nccl-tests-container.sbatch
runs an all_reduce_perf benchmark across 2 nodes and is the quickest way to confirm
the GPU queue, Pyxis containers, and EFA work end-to-end. Two PCS-specific deltas are
all you need to add:
1. Import the image on the login node — enroot import builds its overlayfs on the
node-local root disk (the login node has 300 GiB), and FSx for Lustre can't host that
overlay; only the resulting .sqsh lands on shared /fsx. Pin a specific image tag
for reproducible numbers (don't use latest):
TAG=cuda12.8.1-efa1.43.2-ofiv1.16.3-ncclv2.27.7-1-testsv2.16.9
enroot import -o /fsx/nccl-tests.sqsh "docker://public.ecr.aws#hpc-cloud/nccl-tests:${TAG}"2. Submit on your GPU partition — the canonical sbatch reads $IMAGE
(/fsx/nccl-tests.sqsh by default) and defaults to 2 nodes / 8 tasks per node:
cd /fsx && git clone --depth 1 https://github.com/awslabs/awsome-distributed-ai.git
sbatch --partition=gpu-p6b300 \
/fsx/awsome-distributed-ai/micro-benchmarks/nccl-tests/slurm/nccl-tests-container.sbatch3. Check the result (nccl-all_reduce_perf_<jobid>.out). EFA is in use when you see
NET/OFI Selected provider is efa ... (found N nics), and a healthy run ends with
# Out of bounds values : 0 OK plus a busbw column that scales up with message size
(e.g. ~751 GB/s at 64 GiB on 2× p6-b300; raise -e past the default 16 GiB to saturate
B300's 16 EFA cards).
For a full training example, see the PyTorch FSDP test case. For the full validation matrix (monitoring, containers, CPU/GPU, NCCL, FSDP) and the PCS deltas worth knowing, see the Test & Validation Guide.
With DeployMonitoring=true (default), an integrated monitoring stack based on
aws-parallelcluster-monitoring
is installed automatically:
- Login node: Prometheus, Grafana, Nginx (reverse proxy), Node Exporter, Pushgateway, CloudWatch Exporter
- Compute nodes: Node Exporter, plus DCGM Exporter on GPU nodes
- Slurm: native OpenMetrics on the controller (jobs/nodes/partitions/scheduler)
Metrics cover Slurm jobs, GPU (utilization/memory/temperature/power/ECC/NVLink via DCGM),
node CPU/memory/disk/network, and CloudWatch (EC2/FSx/PCS). The stack installs on
node-local /opt (not the shared /home). Pre-built Grafana dashboards (Cluster Summary,
Slurm Detail, GPU Node List, GPU Health, Cluster Costs, Storage) are provisioned
automatically — see the screenshot below.
GPU metrics work out of the box across the supported GPU range (Hopper / B200 / B300).
DcgmExporterImagedefaults to a DCGM 4.5.2 build pinned by digest, validated on 2× p6-b300 and on B200. The monitoring stack's own default (DCGM 4.2.0) tops out at B200 and can't pull newer NVCR tags on Docker 29.x — overriding via digest at the deploy-all level is what bridges that. OverrideDcgmExporterImageonly if you need to pin to a different build; details: OPERATIONS.md §3.1.
Prefer AWS-managed Prometheus/Grafana? If you'd rather use Amazon Managed Service for Prometheus + Amazon Managed Grafana instead of the self-hosted stack on the login node, see
4.validation_and_observability/4.prometheus-grafana.
Monitoring-related parameters:
DeployMonitoring(defaulttrue)MonitoringVersion— aws-parallelcluster-monitoring git ref (release tag, branch, orlatest; defaultv2.9.1). Pinned to a tag so upstream changes can't break deployments unexpectedly.v2.9.1adds theDCGM_EXPORTER_IMAGEoverride (letsDcgmExporterImageenable B300 GPU metrics);v2.6.4+ carry the PCS fixes (node-local/optinstall + Docker-29.x DCGM tag).MonitoringRepo—owner/repoto fetch from (defaultaws-samples/aws-parallelcluster-monitoring). Point at a fork + a branch inMonitoringVersionto test unreleased changes.DcgmExporterImage— dcgm-exporter image used on GPU nodes; defaults to a DCGM 4.5.2 build pinned by digest (covers Hopper/B200/B300). Override only if you need to pin to a different build (e.g. the older monitoring-default DCGM 4.2.0).
Node type is identified by the
monitoring-roletag (login/compute), not the EC2Nametag — theNametag defaults toPCS-<cngname>and is free for you to retag.
Log in to Grafana as admin; the password is generated per cluster and stored in
SSM Parameter Store. Retrieve it (with CLUSTER_ID from the stack's ClusterId output):
aws ssm get-parameter --name "/pcs/${CLUSTER_ID}/grafana/admin-password" \
--with-decryption --query 'Parameter.Value' --output textThere are two ways to reach the UI.
No public access required; works even when the login node has no inbound rules.
# Login node instance ID
INSTANCE_ID=$(aws ec2 describe-instances \
--filters "Name=tag:aws:pcs:compute-node-group-name,Values=login" \
"Name=instance-state-name,Values=running" \
--query 'Reservations[0].Instances[0].InstanceId' --output text)
# Port-forward remote 443 -> local 8443 (needs the Session Manager plugin)
aws ssm start-session --target $INSTANCE_ID \
--document-name AWS-StartPortForwardingSession \
--parameters '{"portNumber":["443"],"localPortNumber":["8443"]}'Then open https://localhost:8443/grafana/.
To browse Grafana directly without port forwarding, set GrafanaPublicAccessCidr at
deploy time to a CIDR you trust (e.g. your office IP 203.0.113.4/32). deploy-all then
creates a login-only security group that opens HTTPS/443 to that CIDR and
attaches it to the login node, so you can open:
https://<login-node-public-ip>/grafana/
Get the login node's public IP from the EC2 console, or:
aws ec2 describe-instances \
--filters "Name=tag:aws:pcs:compute-node-group-name,Values=login" \
"Name=instance-state-name,Values=running" \
--query 'Reservations[0].Instances[0].PublicIpAddress' --output textSecurity notes:
- The security group is attached only to the login node — compute nodes and FSx (which share the cluster security group) are not exposed.
- Opening 443 exposes more than Grafana. The login node's nginx also reverse-proxies
/prometheus/,/pushgateway/, and/slurmexporter/, and those endpoints are unauthenticated. Anyone who can reach the allowed CIDR can read all cluster metrics (and push to Pushgateway) without credentials — only the/grafana/path is password-gated. Treat this as exposing the whole monitoring stack, not just the Grafana login. - Prefer a tight CIDR (a
/32host or your VPN range).0.0.0.0/0is accepted — it can be convenient for a short-lived PoC or workshop where granting each user local SSM permissions is impractical — but it exposes the unauthenticated endpoints above to the whole internet. If you use it, narrow it to a real CIDR or clear it (Option A) as soon as you are done. - The certificate is self-signed, so browsers show a warning — proceed past it, or put an ALB + ACM certificate in front for a trusted cert.
- Leaving
GrafanaPublicAccessCidrempty (the default) keeps monitoring private; use Option A.
Once logged in, use the dashboard nav bar to switch between Cluster Summary, Slurm Detail, Compute Node List, GPU Node List, GPU Health, Cluster Costs, and Storage. For example, the GPU Node List shows each GPU node's model, instance type, utilization, temperature, power, and memory:
For detailed validation steps and the full test matrix (monitoring, containers, CPU/GPU, NCCL, FSDP), see the Test & Validation Guide.
Use
v2.9.1or newer for PCS. Carries the PCS/optinstall fix (v2.6.4), Docker-29.x DCGM tag (v2.6.5), Grafana 13 (v2.9), and theDCGM_EXPORTER_IMAGEoverride needed byDcgmExporterImagefor B300 (v2.9.1). Migration notes: OPERATIONS.md §3.
The all-in-one template installs Enroot/Pyxis at first boot via
PostInstallScriptUrl, which is fast to deploy and avoids an Image Builder step.
For frequent scaling in production, pre-baking Enroot/Pyxis into a custom AMI
drops node boot time from ~8–12 min to ~3 min and pins every node to a deterministic
state. This is a separate, standalone path: build the AMI once with
pcs-ready-dlami-with-enroot-pyxis.yaml,
then pass the resulting ami-xxx as AmiId to the cluster.
Step 1: Build the AMI (~30 min one-time, separate stack)
aws cloudformation create-stack \
--stack-name pcs-dlami \
--template-url https://awsome-distributed-ai.s3.amazonaws.com/templates/pcs-ready-dlami-with-enroot-pyxis.yaml \
--parameters ParameterKey=SlurmVersion,ParameterValue=25.11 \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAMThe AMI is single-Slurm-version by design: Pyxis is a SPANK plugin whose ABI is
locked to its compile-time Slurm version, so pass the same SlurmVersion you'll use
on the cluster.
Step 2: Read the resulting AMI ID from the stack output DLAMIforPCSAmiId:
AMI_ID=$(aws cloudformation describe-stacks \
--stack-name pcs-dlami \
--query 'Stacks[0].Outputs[?OutputKey==`DLAMIforPCSAmiId`].OutputValue' \
--output text)
echo "$AMI_ID" # ami-0xxxxxxxxxxxxxxxxStep 3: Pass it to the cluster as AmiId and clear PostInstallScriptUrl for
the cleanest boot:
aws cloudformation create-stack \
--stack-name pcs-ml-cluster \
--template-url https://awsome-distributed-ai.s3.amazonaws.com/templates/pcs-ml-cluster-deploy-all.yaml \
--parameters \
ParameterKey=PrimarySubnetAZ,ParameterValue=us-east-1a \
ParameterKey=AmiId,ParameterValue=$AMI_ID \
ParameterKey=PostInstallScriptUrl,ParameterValue= \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAMThe post-install hook is idempotent, so leaving PostInstallScriptUrl at its default
also works — the installer detects Enroot/Pyxis is already present and is a fast
no-op. Setting it empty just shaves a few seconds off boot.
Optional features of pcs-ready-dlami-with-enroot-pyxis.yaml (defaults are off):
BuildSchedule=Weekly/Monthlyfor scheduled rebuilds against a moving base AMIEnableLifecyclePolicy=trueto deprecate older AMIs afterLifecycleDeprecateAfterWeeksPublishToSsm=trueto publish the latest AMI ID to an SSM parameter for downstream stacks
For production deploys that pin the AMI explicitly per cluster, none of these are needed.
All templates live in assets/. pcs-ml-cluster-deploy-all.yaml nests
the others; you can also deploy each individually for more control (e.g. reuse a VPC/FSx
across clusters). Click Deploy to 1-click-launch a single template. For every
parameter and default, see PARAMETERS.md.
| Template | Purpose | Deploy |
|---|---|---|
pcs-ml-cluster-deploy-all.yaml |
All-in-one: Prerequisites + (optional AMI) + Cluster + login/CPU/GPU CNGs | 🚀 |
ml-cluster-prerequisites.yaml |
VPC, subnets, security groups, FSx for Lustre + OpenZFS | 🚀 |
cluster.yaml |
PCS cluster core (Slurm scheduler only, no nodes) | 🚀 |
add-cng.yaml |
Compute node group, single NIC — login nodes, CPU/single-NIC-GPU queues (C6i, G5, G6) | 🚀 |
add-cng-p5.yaml |
P5/P5e/P5en nodes (16/32 EFA interfaces, by type) | 🚀 |
add-cng-p6-b200.yaml |
P6-B200 nodes (8 EFA interfaces) | 🚀 |
add-cng-p6-b300.yaml |
P6-B300 nodes (16 EFA interfaces) | 🚀 |
pcs-ready-dlami-with-enroot-pyxis.yaml |
EC2 Image Builder: bake Enroot 3.5.0 + Pyxis 0.20.0 into the PCS-Ready DLAMI | 🚀 |
add-cng* templates create a Slurm queue only when QueueName is set (leave it empty
for login nodes). The P-series templates need a CapacityReservationId when using a
Capacity Block.
Validated configurations:
- Infrastructure (
ml-cluster-prerequisites.yaml,cluster.yaml): multiple Regions (us-east-1/us-west-2/us-east-2), Slurm 25.05 & 25.11. - CPU / single-NIC GPU (
add-cng.yaml): login (m6i.4xlarge), CPU (c6i.4xlarge), GPU (g6.xlarge/g6.12xlarge). - P5 (
add-cng-p5.yaml): p5.48xlarge / p5en.48xlarge with ODCR and Capacity Blocks for ML. - P6-B300 (
add-cng-p6-b300.yaml): validated on real p6-b300.48xlarge (Capacity Blocks, us-west-2) — 17 network cards, EFA active (NCCLfound 16 nics), 2-node all_reduce ~761 GB/s peak, FSDP Llama-2 7B multi-node. - P6-B200 (
add-cng-p6-b200.yaml): 8-network-card template (same EFA layout family as B300). - All-in-one (
pcs-ml-cluster-deploy-all.yaml): selects the P5/P6-B200/P6-B300 CNG template automatically fromPseriesInstanceType; tested end-to-end with monitoring, container jobs, NCCL, and FSDP on p6-b300. - AMI builder (
pcs-ready-dlami-with-enroot-pyxis.yaml): Ubuntu 24.04 x86_64 AMIs with Enroot 3.5.0 + Pyxis 0.20.0, validated with PyTorch/CUDA containers.
- Operations guide — version trade-offs, AMI single-version rule, monitoring/B300 dcgm setup, AMI pinning, FSx coupling, recommended production settings
- Roadmap / TODO — implementation items under consideration
- Parameter reference — every deploy-all parameter and default
- AWS Parallel Computing Service Documentation
- AI/ML for AWS PCS Workshop
- Slurm Documentation
- Enroot · Pyxis
- Capacity Blocks for ML
- aws-parallelcluster-monitoring (upstream monitoring)
- Prometheus & Grafana Setup (alternative: AWS-managed Prometheus/Grafana)
- LDAP Server Setup Guide — OpenLDAP for cluster-wide user management

