Skip to content

[Bug]: Terraform for HyperPod Slurm uses too small CIDR block for private subnet by default #1122

@shimomut

Description

@shimomut

Prerequisites

  • I have searched existing issues to make sure this bug has not already been reported.
  • I have verified the issue against the latest main branch.

Environment

  • AWS Service: SageMaker HyperPod Slurm
  • Instance type: N/A
  • Number of nodes: N/A
  • OS / AMI: HyperPod DLAMI
  • Training framework: N/A
  • CUDA / Driver version: N/A
  • NCCL version: N/A
  • EFA installer version: N/A
  • Container image (if applicable): N/A
  • Scheduler (if applicable): N/A

Bug Description


In the Terraform scripts for HyperPod Slurm (https://github.com/awslabs/awsome-distributed-ai/tree/main/1.architectures/5.sagemaker-hyperpod/terraform-modules/hyperpod-slurm-tf), the default example configuration terraform.tfvars.example is using too small CIDR block for HyperPod nodes /24, and it can be a pitfall for customers.

In main.tf, we are adding secondary CIDR to the VPC:

resource "aws_vpc_ipv4_cidr_block_association" "secondary" {
  vpc_id     = aws_vpc.main.id
  cidr_block = "10.1.0.0/16"
}

And the default value of private_subnet_cidr is using it,

variable "private_subnet_cidr" {
  description = "The IP range (CIDR notation) for the private subnet"
  type        = string
  default     = "10.1.0.0/16"
}

But the default terraform.tfvars.example file is overwriting it with small size.

private_subnet_cidr = "10.0.2.0/24"

Steps to Reproduce

  1. Deploy HyperPod Slurm cluster with Terraform, simply following the instruction
  2. Check if it uses 10.1.0.0/16 by default.

Relevant Log Output

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions