Skip to content

Commit 97061c8

Browse files
author
Gang Fu
committed
add Hyperpod lab for NxD llama3 model training
1 parent 06dfd04 commit 97061c8

File tree

4 files changed

+131
-0
lines changed

4 files changed

+131
-0
lines changed

labs/Hyperpod/.DS_Store

6 KB
Binary file not shown.

labs/Hyperpod/README.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Build on Trainium Start Guide Using SageMaker Hyperpod
2+
3+
In this tutorial, we will use Neuronx-Distributed (NxD) library to train llama3 model like [this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/03-trainium-nxd)
4+
5+
If you want to use SageMaker AI Studio space to run this workshop, and it is a new account or account without VPC, SageMaker domain yet, follow [the CloudFormation deployment here](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/00-setup/env-setup/01-env-sm-code-editorThe) to create the SageMaker AI domain and VC code editor space. The SageMaker Domain is created in the default VPC. Once deployed, open SageMaker AI studio, run the Code Editor default space.
6+
7+
### Step 1 Build the Container
8+
9+
Similar to [this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/03-trainium-nxd/01-setup), we need first build the container image to run the training job, using the latest Neuron SDK base container:
10+
11+
```bash
12+
region=us-east-2
13+
dlc_account_id=763104351884
14+
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com
15+
16+
docker pull 763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training-neuronx:2.7.0-neuronx-py310-sdk2.24.1-ubuntu22.04
17+
```
18+
19+
clone the repo and go to the folder:
20+
```bash
21+
cd ~
22+
git clone https://github.com/aws-samples/awsome-distributed-training/
23+
cd awsome-distributed-training/3.test_cases/pytorch/neuronx-distributed/llama3/kubernetes
24+
```
25+
26+
We will build docker image using the Dockerfile in this directory.
27+
```bash
28+
export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
29+
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
30+
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
31+
export IMAGE=llama3_trn
32+
export TAG=:latest
33+
docker build $DOCKER_NETWORK -t ${REGISTRY}${IMAGE}${TAG} .
34+
```
35+
36+
Then push the image to the ECR private registry
37+
```bash
38+
# Create registry if needed
39+
export REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"${IMAGE}\" | wc -l)
40+
if [ "${REGISTRY_COUNT//[!0-9]/}" == "0" ]; then
41+
echo "Creating repository ${REGISTRY}${IMAGE} ..."
42+
aws ecr create-repository --repository-name ${IMAGE}
43+
else
44+
echo "Repository ${REGISTRY}${IMAGE} already exists"
45+
fi
46+
47+
# Login to registry
48+
echo "Logging in to $REGISTRY ..."
49+
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
50+
51+
# Push image to registry
52+
docker image push ${REGISTRY}${IMAGE}${TAG}
53+
```
54+
55+
### Step 2 Create Hyperpod Cluster
56+
You can use [the CloudFormation deployment here](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/00-setup/00-workshop-infra-cfn) to create the Hyperpod cluster with EKS.
57+
58+
Here are the parameters to change to use ml.trn1.32xlarge instance in us-west-2:
59+
1. Set AvailabilityZoneId to usw2-az4 to better get on-demand instance
60+
2. Set UsingSMCodeEditor to True if you want to access the cluster from VS code editor in SageMaker AI domain.
61+
3. Set AcceleratedInstanceType to ml.trn1.32xlarge
62+
4. Set kubernetes version to 1.32
63+
64+
Once CFN deployment finished successfully, you can manually verify the VPC, subnet, and SG are same as CFN deployment output, and you can execute the [same commands in this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/00-setup/00-workshop-infra-cfn#environment-variables) to setup environment variables.
65+
66+
You will also need to set up an FSx for Lustre File System through [Dynamic Provisioning in this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/01-cluster/06-fsx-for-lustre). It is noteworthy the namespace of this PVC is default, and your training job pod will need to be in the same namespace.
67+
68+
69+
### Step 3 Start Training Job
70+
71+
Let us launches a training job to train 8B Llama 3.1 model. First, update the HF_ACCESS_TOKEN in generate-jobspec.sh file. Then execute it:
72+
```bash
73+
./generate-jobspec.sh
74+
```
75+
the script creates 2 yaml files tokenize_data.yaml and llama3_train.yaml.
76+
77+
Next download the dataset and tokenize it from Hugginface Hub using tokenize_data.yaml job. The job stores the dataset in Fsx Lustre for training the model next.
78+
79+
```bash
80+
kubectl apply -f ./tokenize_data.yaml
81+
```
82+
83+
To list all of pods in different namespaces:
84+
```bash
85+
kubectl get pods --all-namespaces
86+
```
87+
88+
The tokenize-data pod should run in default namespace. To describe the pod:
89+
```bash
90+
kubectl describe pod tokenize-data
91+
```
92+
93+
To check logs:
94+
```bash
95+
kubectl logs -f tokenize-data
96+
```
97+
98+
Once the tokenize-data pod is complete, you can use the train_llama3.yaml job spec file to train llama 3.1 8B model with the tokenized data from previous step.
99+
100+
```bash
101+
kubectl apply -f ./llama3_train.yaml
102+
```
103+
104+
You should be able to see two pods (etcd and trn1-llama3-worker-0) are running. Similarly, to check logs:
105+
```bash
106+
kubectl logs -f trn1-llama3-worker-0
107+
```
108+
109+
If the pod is not in running state, you can delete it:
110+
```bash
111+
kubectl delete -f ./llama3_train.yaml
112+
```
113+
114+
Once job start running successfully, you can run command line inside the container:
115+
```bash
116+
kubectl exec -it trn1-llama3-worker-0 —- neuron-top
117+
```
118+
119+
You may see something similar to this:
120+
<img src="figures/neuron-top.png" width="888">
121+
122+
Ctrl+C to exit the visualization.
123+
124+
You can check the running job status on Hyperpod Task Governance as well:
125+
<img src="figures/taskgovernance.png" width="888">
126+
127+
To cleanup, you can delete all of the pods:
128+
```bash
129+
kubectl delete -f ./llama3_train.yaml
130+
kubectl delete -f ./tokenize_data.yaml
131+
```
368 KB
Loading
114 KB
Loading

0 commit comments

Comments
 (0)