Skip to content

[Docs]: HyperPod slurmdbd.conf snippet missing TLS configuration for require_secure_transport: ON #1095

@KeitaW

Description

@KeitaW

Documentation Issue

The Aurora cluster parameter group in 1.architectures/8.accounting-database/cf_database-accounting.yaml enforces TLS:

# cf_database-accounting.yaml L89–94
AccountingClusterParameterGroup:
  Type: 'AWS::RDS::DBClusterParameterGroup'
  Properties:
    ...
    Parameters:
      require_secure_transport: 'ON'

…but the ## Amazon SageMaker HyperPod Orchestrated by Slurm section of README.md (L78–95) builds slurmdbd.conf without any TLS configuration:

cat > /opt/slurm/etc/slurmdbd.conf << EOF
AuthType=auth/munge
DbdHost=$(hostname)
DbdPort=6819
SlurmUser=slurm
LogFile=/var/log/slurmdbd.log
StorageType=accounting_storage/mysql
StorageUser=${DATABASE_ADMIN}
StoragePass=$(aws secretsmanager get-secret-value --secret-id ${DATABASE_SECRET_ARN} --query SecretString --output text)
StorageHost=${DATABASE_URI}
StoragePort=3306
EOF

slurmdbd uses libmysqlclient under the hood. With require_secure_transport: ON, the server requires the client to negotiate TLS, and the RDS public CA is not in the default trust store on Amazon Linux or Ubuntu base images. Following the README verbatim on a HyperPod controller, slurmdbd fails to connect (SSL handshake error visible in /var/log/slurmdbd.log).

Scope note: This is HyperPod-only. The ParallelCluster section above it (L60–73) defers connection setup to PC's SlurmSettings.Database, which plumbs the RDS CA automatically — verified on a live PC 3.15 cluster against this template.

Suggested Fix

Add a step to download the RDS global CA bundle before building slurmdbd.conf, and add a StorageParameters line:

# Download the global RDS public CA bundle
sudo curl -o /etc/ssl/certs/rds-global-bundle.pem \
  https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem
sudo chmod 644 /etc/ssl/certs/rds-global-bundle.pem
cat > /opt/slurm/etc/slurmdbd.conf << EOF
AuthType=auth/munge
DbdHost=$(hostname)
DbdPort=6819
SlurmUser=slurm
LogFile=/var/log/slurmdbd.log
StorageType=accounting_storage/mysql
StorageUser=${DATABASE_ADMIN}
StoragePass=$(aws secretsmanager get-secret-value --secret-id ${DATABASE_SECRET_ARN} --query SecretString --output text)
StorageHost=${DATABASE_URI}
StoragePort=3306
StorageParameters=SSL_CA=/etc/ssl/certs/rds-global-bundle.pem
EOF

Reference: Slurm slurmdbd.conf docs — StorageParameters, Using SSL/TLS with Aurora.

Happy to submit a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions