SageMaker HyperPod Monitoring

This repository provides a comprehensive guide for deploying an observability stack tailored to enhance monitoring capabilities for your SageMaker HyperPod cluster. It demonstrates how to export both cluster metrics (SLURM-exporter) and node metrics (DCGM-exporter, EFA-node-exporter) to a Prometheus/Grafana monitoring stack. This setup enables your administrators, ML-ops teams, and model developers to access real-time metrics, offering valuable insights into your cluster's performance.

To get started, you will initiate the provisioning of an Amazon CloudFormation Stack within your AWS Account. You can find the complete stack template in cluster-observability.yaml. This CloudFormation stack will orchestrate the deployment of the following resources dedicated to cluster monitoring in your AWS environment:

If you are using an environment which does not allow to use IAM Identity Center or SAML, consider alternative OS grafana option.

The solution uses SageMaker HyperPod Lifecycle Scripts, to bootstrap your cluster with the following open-source exporter services:

Name	Script Deployment Target	Metrics Description
`0.Prometheus Slurm Exporter`	controller-node	SLURM Accounting metrics (sinfo, sacct)
`1.EFA-Node-Exporter`	cluster-nodes	Fork of Node exporter to include metrics from emitted from EFA
`2.NVIDIA-DCGM-Exporter`	cluster-nodes	Nvidia DCGM Metrics about Nvidia Enabled GPUs

Prerequisites

To enable these exporter services, modify the config.py file to configure enable_observability = True. Save this file, and upload it to the s3 bucket path referenced in your cluster-config.json file. By modifying config.py and uploading to S3, this will ensure that any new nodes added or replaced in the HyperPod cluster will also be created with the metric exporter scripts running

If you have already created your HyperPod cluster, you can follow these instructions to update your existing HyperPod cluster with Observabiltiy.

Important

Before proceeding, you will need to add the following AWS Managed IAM Policies to your AmazonSagemakerClusterExecutionRole:

AmazonPrometheusRemoteWriteAccess: this will give the control node access to write cluster metrics to the Amazon Managed Prometheus Workspace you will create.
AWSCloudFormatinoReadOnlyAccess this will give the install_prometheus.sh file permissions to read stack outputs (remotewriteurl, region) from your cloudformation stack

Deploy the CloudFormation Stack

1-Click Deploy 🚀

Alternatively, you can deploy OS Grafana stack.

1-Click Deploy 🚀

Important

It is strongly recommended you deploy this stack into the same region and same account as your SageMaker HyperPod Cluster.This will ensure successful execution of the Lifecycle Scripts, specifically install_prometheus.sh, which relies on AWS CLI commands that assume same account and same region.

Connect to the cluster

Connect to the controller node of your cluster via ssm:

Note

You can find the ClusterID, WorkerGroup, and Instance ID of your controller node in the SageMaker Console or via the AWS CLI

aws ssm start-session —target sagemaker-cluster:<CLUSTER_ID>_<WORKER_GROUP>-<INSTANCE_ID>

Verify the new prometheus config and service file created by install_prometheus.sh is running on the controller node:

sudo systemctl status prometheus

The output should show active (running):

You can validate the prometheus confiugration file with:

cat /etc/prometheus/prometheus.yml

Your file should look similar to the following:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 15s

scrape_configs:
  - job_name: 'slurm_exporter'
    static_configs:
      - targets:
          - 'localhost:8080'
  - job_name: 'dcgm_exporter'
    static_configs:
      - targets:
          - '<ComputeNodeIP>:9400'
          - '<ComputeNodeIP>:9400'
  - job_name: 'efa_node_exporter'
    static_configs:
      - targets:
          - '<ComputeNodeIP>:9100'
          - '<ComputeNodeIP>:9100'

remote_write:
  - url: <AMPReoteWriteURL>
    queue_config:
      max_samples_per_send: 1000
      max_shards: 200
      capacity: 2500
    sigv4:
      region: <Region>

You can curl for relevant Promtetheus metrics on the controller nodes using:

curl -s http://localhost:9090/metrics | grep -E 'slurm|dcgm|efa'

With node and cluster metrics now being exported to Amazon Managed Prometheus Workspace via prometheus remote write from the control node, next you will set up the Amazon Managed Grafana Workspace.

Setup the Grafana Workspace

Important

Before proceeding, ensure your AWS Account has been setup with AWS Identity Center. It will be used to authenticate to the Amazon Managed Grafana Workspace in the final steps:

Navigate to Amazon Managed Grafana in the AWS Management Console

In the Authentication Tab, configure Authentication using AWS IAM Identity Center:

Note

Configure your AWS IAM Identity Center User as User type: Admin.

Within the DataSources Tab of your Grafana workspace, click the “Configure in Grafana” link to Configure Prometheus as a data source.

You will prompted to authenticate to the Grafana workspace with the IAM Identity Center Username and Password. This is the you set up for the workspace.

Note

If you have forgotten username and password, you can find and reset them within IAM Identity Center

Once you are in the Amazon Managed Grafana Workspace "datasources" page, select the AWS Region and Prometheus Workspace ID of your Amazon Managed Prometheus Workspace ID.

Build Grafana Dashboards

Finally, with authentication and data sources setup, within your grafana workspace, select dashboards > new > import.

To display metrics for the exporter services, you can start by configuring and customizing the following 3 open source Grafana Dashboards by copying and pasting the below links:

Congratulations, you can now view real time metrics about your Sagemaker HyperPod Cluster and compute nodes in Grafana!