Prerequisites
Environment
- AWS Service: SageMaker HyperPod Slurm
- Instance type: N/A
- Number of nodes: N/A
- OS / AMI: HyperPod DLAMI
- Training framework: N/A
- CUDA / Driver version: N/A
- NCCL version: N/A
- EFA installer version: N/A
- Container image (if applicable): N/A
- Scheduler (if applicable): N/A
Bug Description
In the Terraform scripts for HyperPod Slurm (https://github.com/awslabs/awsome-distributed-ai/tree/main/1.architectures/5.sagemaker-hyperpod/terraform-modules/hyperpod-slurm-tf), the default example configuration terraform.tfvars.example is using too small CIDR block for HyperPod nodes /24, and it can be a pitfall for customers.
In main.tf, we are adding secondary CIDR to the VPC:
resource "aws_vpc_ipv4_cidr_block_association" "secondary" {
vpc_id = aws_vpc.main.id
cidr_block = "10.1.0.0/16"
}
And the default value of private_subnet_cidr is using it,
variable "private_subnet_cidr" {
description = "The IP range (CIDR notation) for the private subnet"
type = string
default = "10.1.0.0/16"
}
But the default terraform.tfvars.example file is overwriting it with small size.
private_subnet_cidr = "10.0.2.0/24"
Steps to Reproduce
- Deploy HyperPod Slurm cluster with Terraform, simply following the instruction
- Check if it uses 10.1.0.0/16 by default.
Relevant Log Output
Prerequisites
mainbranch.Environment
Bug Description
In the Terraform scripts for HyperPod Slurm (https://github.com/awslabs/awsome-distributed-ai/tree/main/1.architectures/5.sagemaker-hyperpod/terraform-modules/hyperpod-slurm-tf), the default example configuration terraform.tfvars.example is using too small CIDR block for HyperPod nodes /24, and it can be a pitfall for customers.
In main.tf, we are adding secondary CIDR to the VPC:
And the default value of
private_subnet_cidris using it,But the default terraform.tfvars.example file is overwriting it with small size.
Steps to Reproduce
Relevant Log Output