From 8ca187a3c5ec2183b4d29e27fbd2ea2d61e5ed31 Mon Sep 17 00:00:00 2001 From: Xibin Liu Date: Tue, 7 Oct 2025 19:06:21 +0000 Subject: [PATCH] Add new training recipes for v5p and update README links --- .../trillium/collectives/README.md | 2 +- training/{trillium => }/MAXTEXT_README.md | 0 training/{trillium => }/XPK_README.md | 0 .../trillium/GPT3-175B-MaxText/bf16/README.md | 4 +- .../trillium/GPT3-175B-MaxText/fp8/README.md | 4 +- .../trillium/Llama2-70B-MaxText/README.md | 4 +- .../trillium/Llama3.1-405B-MaxText/README.md | 4 +- .../Llama3.1-70B-MaxText/v6e-128/README.md | 4 +- .../Llama3.1-70B-MaxText/v6e-256/README.md | 4 +- .../Llama3.1-70B-MaxText/v6e-32/README.md | 4 +- .../Llama3.1-70B-MaxText/v6e-64/README.md | 4 +- .../Llama3.1-8B-MaxText/v6e-128/README.md | 4 +- .../Llama3.1-8B-MaxText/v6e-16/README.md | 4 +- .../Llama3.1-8B-MaxText/v6e-256/README.md | 4 +- .../Llama3.1-8B-MaxText/v6e-32/README.md | 4 +- .../Llama3.1-8B-MaxText/v6e-64/README.md | 4 +- .../Llama3.1-8B-MaxText/v6e-8/README.md | 4 +- .../trillium/Mistral-7B-MaxText/README.md | 4 +- .../trillium/Mixtral-8x22B-MaxText/README.md | 4 +- .../trillium/Mixtral-8x7B-MaxText/README.md | 4 +- training/v5p/DeepSeek3-671B-MaxText/README.md | 61 +++++++++++++ .../v5p/Diffusion-2-MaxDiffusion/README.md | 2 +- training/v5p/GPT3-175B-MaxText/README.md | 2 +- training/v5p/Llama2-7B-Maxtext/README.md | 2 +- training/v5p/Llama3.1-405B-MaxText/README.md | 63 ++++++++++++++ .../README.md | 4 +- .../Llama4-Scout-17B-16E-Maxtext/README.md | 87 +++++++------------ training/v5p/Mixtral-8X7B-Maxtext/README.md | 2 +- training/v5p/SDXL-MaxDiffusion/README.md | 2 +- training/v5p/XPK_README.md | 82 ----------------- 30 files changed, 199 insertions(+), 178 deletions(-) rename training/{trillium => }/MAXTEXT_README.md (100%) rename training/{trillium => }/XPK_README.md (100%) create mode 100644 training/v5p/DeepSeek3-671B-MaxText/README.md create mode 100644 training/v5p/Llama3.1-405B-MaxText/README.md delete mode 100644 training/v5p/XPK_README.md diff --git a/microbenchmarks/trillium/collectives/README.md b/microbenchmarks/trillium/collectives/README.md index 2710a3a..3b1089e 100644 --- a/microbenchmarks/trillium/collectives/README.md +++ b/microbenchmarks/trillium/collectives/README.md @@ -1,7 +1,7 @@ # Instructions for running Collectives Benchmark on TPU trillium (v6e-256) ## XPK setup -Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Run Collectives on v6e-256 diff --git a/training/trillium/MAXTEXT_README.md b/training/MAXTEXT_README.md similarity index 100% rename from training/trillium/MAXTEXT_README.md rename to training/MAXTEXT_README.md diff --git a/training/trillium/XPK_README.md b/training/XPK_README.md similarity index 100% rename from training/trillium/XPK_README.md rename to training/XPK_README.md diff --git a/training/trillium/GPT3-175B-MaxText/bf16/README.md b/training/trillium/GPT3-175B-MaxText/bf16/README.md index e5b43be..9fe5487 100644 --- a/training/trillium/GPT3-175B-MaxText/bf16/README.md +++ b/training/trillium/GPT3-175B-MaxText/bf16/README.md @@ -1,12 +1,12 @@ # Instructions for training GPT3-175B-Maxtext on TPU trillium ## XPK setup -Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext ### Install MaxText and Build Docker Image -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, use the MaxText [tpu-recipes-v0.1.2](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.2) tag to run this recipe: ``` diff --git a/training/trillium/GPT3-175B-MaxText/fp8/README.md b/training/trillium/GPT3-175B-MaxText/fp8/README.md index b85638e..16ca82d 100644 --- a/training/trillium/GPT3-175B-MaxText/fp8/README.md +++ b/training/trillium/GPT3-175B-MaxText/fp8/README.md @@ -1,10 +1,10 @@ # Instructions for training GPT3-175B-Maxtext on TPU trillium ## XPK setup -Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext -Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build docker image +Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build docker image ## Run Maxtext GPT3-175B workloads on GKE diff --git a/training/trillium/Llama2-70B-MaxText/README.md b/training/trillium/Llama2-70B-MaxText/README.md index f3d23d4..91745de 100644 --- a/training/trillium/Llama2-70B-MaxText/README.md +++ b/training/trillium/Llama2-70B-MaxText/README.md @@ -1,12 +1,12 @@ # Instructions for training Llama2-70B-Maxtext on TPU trillium ## XPK setup -Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext ### Install MaxText and Build Docker Image -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, use the MaxText [tpu-recipes-v0.1.2](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.2) tag to run this recipe: ``` diff --git a/training/trillium/Llama3.1-405B-MaxText/README.md b/training/trillium/Llama3.1-405B-MaxText/README.md index 6c57452..c80ea1c 100644 --- a/training/trillium/Llama3.1-405B-MaxText/README.md +++ b/training/trillium/Llama3.1-405B-MaxText/README.md @@ -1,12 +1,12 @@ # Instructions for training Llama3.1-405B-MaxText on TPU trillium ## XPK setup -Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext ### Install MaxText and Build Docker Image -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, use the MaxText [tpu-recipes-v0.1.2](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.2) tag to run this recipe: ``` diff --git a/training/trillium/Llama3.1-70B-MaxText/v6e-128/README.md b/training/trillium/Llama3.1-70B-MaxText/v6e-128/README.md index da7d2bd..b4eadf2 100644 --- a/training/trillium/Llama3.1-70B-MaxText/v6e-128/README.md +++ b/training/trillium/Llama3.1-70B-MaxText/v6e-128/README.md @@ -1,12 +1,12 @@ # Instructions for training Llama3.1-70B-MaxText on TPU trillium (v6e-128) ## XPK setup -Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext ### Install MaxText and Build Docker Image -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, use the MaxText [tpu-recipes-v0.1.4](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.4) tag to run this recipe: ``` diff --git a/training/trillium/Llama3.1-70B-MaxText/v6e-256/README.md b/training/trillium/Llama3.1-70B-MaxText/v6e-256/README.md index 425fe67..b895a13 100644 --- a/training/trillium/Llama3.1-70B-MaxText/v6e-256/README.md +++ b/training/trillium/Llama3.1-70B-MaxText/v6e-256/README.md @@ -1,12 +1,12 @@ # Instructions for training Llama3.1-70B-MaxText on TPU trillium (v6e-256) ## XPK setup -Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext ### Install MaxText and Build Docker Image -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, use the MaxText [tpu-recipes-v0.1.4](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.4) tag to run this recipe: ``` diff --git a/training/trillium/Llama3.1-70B-MaxText/v6e-32/README.md b/training/trillium/Llama3.1-70B-MaxText/v6e-32/README.md index fbfe1bb..a2db4ac 100644 --- a/training/trillium/Llama3.1-70B-MaxText/v6e-32/README.md +++ b/training/trillium/Llama3.1-70B-MaxText/v6e-32/README.md @@ -1,12 +1,12 @@ # Instructions for training Llama3.1-70B-MaxText on TPU trillium (v6e-32) ## XPK setup -Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext ### Install MaxText and Build Docker Image -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, use the MaxText [tpu-recipes-v0.1.4](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.4) tag to run this recipe: ``` diff --git a/training/trillium/Llama3.1-70B-MaxText/v6e-64/README.md b/training/trillium/Llama3.1-70B-MaxText/v6e-64/README.md index 5789f0a..e3cee83 100644 --- a/training/trillium/Llama3.1-70B-MaxText/v6e-64/README.md +++ b/training/trillium/Llama3.1-70B-MaxText/v6e-64/README.md @@ -1,12 +1,12 @@ # Instructions for training Llama3.1-70B-MaxText on TPU trillium (v6e-64) ## XPK setup -Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext ### Install MaxText and Build Docker Image -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, use the MaxText [tpu-recipes-v0.1.4](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.4) tag to run this recipe: ``` diff --git a/training/trillium/Llama3.1-8B-MaxText/v6e-128/README.md b/training/trillium/Llama3.1-8B-MaxText/v6e-128/README.md index fb1580e..e1257f1 100644 --- a/training/trillium/Llama3.1-8B-MaxText/v6e-128/README.md +++ b/training/trillium/Llama3.1-8B-MaxText/v6e-128/README.md @@ -1,12 +1,12 @@ # Instructions for training Llama3.1-8B-MaxText on TPU trillium (v6e-128) ## XPK setup -Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext ### Install MaxText and Build Docker Image -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, use the MaxText [tpu-recipes-v0.1.4](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.4) tag to run this recipe: ``` diff --git a/training/trillium/Llama3.1-8B-MaxText/v6e-16/README.md b/training/trillium/Llama3.1-8B-MaxText/v6e-16/README.md index 92c6d9e..841416b 100644 --- a/training/trillium/Llama3.1-8B-MaxText/v6e-16/README.md +++ b/training/trillium/Llama3.1-8B-MaxText/v6e-16/README.md @@ -1,12 +1,12 @@ # Instructions for training Llama3.1-8B-MaxText on TPU trillium (v6e-16) ## XPK setup -Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext ### Install MaxText and Build Docker Image -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, use the MaxText [tpu-recipes-v0.1.4](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.4) tag to run this recipe: ``` diff --git a/training/trillium/Llama3.1-8B-MaxText/v6e-256/README.md b/training/trillium/Llama3.1-8B-MaxText/v6e-256/README.md index c0f1f6f..d8e41a4 100644 --- a/training/trillium/Llama3.1-8B-MaxText/v6e-256/README.md +++ b/training/trillium/Llama3.1-8B-MaxText/v6e-256/README.md @@ -1,12 +1,12 @@ # Instructions for training Llama3.1-8B-MaxText on TPU trillium (v6e-256) ## XPK setup -Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext ### Install MaxText and Build Docker Image -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, use the MaxText [tpu-recipes-v0.1.4](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.4) tag to run this recipe: ``` diff --git a/training/trillium/Llama3.1-8B-MaxText/v6e-32/README.md b/training/trillium/Llama3.1-8B-MaxText/v6e-32/README.md index 5ac541a..6308b1e 100644 --- a/training/trillium/Llama3.1-8B-MaxText/v6e-32/README.md +++ b/training/trillium/Llama3.1-8B-MaxText/v6e-32/README.md @@ -1,12 +1,12 @@ # Instructions for training Llama3.1-8B-MaxText on TPU trillium (v6e-32) ## XPK setup -Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext ### Install MaxText and Build Docker Image -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, use the MaxText [tpu-recipes-v0.1.4](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.4) tag to run this recipe: ``` diff --git a/training/trillium/Llama3.1-8B-MaxText/v6e-64/README.md b/training/trillium/Llama3.1-8B-MaxText/v6e-64/README.md index 7d4bd2c..780046d 100644 --- a/training/trillium/Llama3.1-8B-MaxText/v6e-64/README.md +++ b/training/trillium/Llama3.1-8B-MaxText/v6e-64/README.md @@ -1,12 +1,12 @@ # Instructions for training Llama3.1-8B-MaxText on TPU trillium (v6e-64) ## XPK setup -Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext ### Install MaxText and Build Docker Image -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, use the MaxText [tpu-recipes-v0.1.4](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.4) tag to run this recipe: ``` diff --git a/training/trillium/Llama3.1-8B-MaxText/v6e-8/README.md b/training/trillium/Llama3.1-8B-MaxText/v6e-8/README.md index 55aa73a..3ad66cd 100644 --- a/training/trillium/Llama3.1-8B-MaxText/v6e-8/README.md +++ b/training/trillium/Llama3.1-8B-MaxText/v6e-8/README.md @@ -1,12 +1,12 @@ # Instructions for training Llama3.1-8B-MaxText on TPU trillium (v6e-8) ## XPK setup -Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext ### Install MaxText and Build Docker Image -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, use the MaxText [tpu-recipes-v0.1.4](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.4) tag to run this recipe: ``` diff --git a/training/trillium/Mistral-7B-MaxText/README.md b/training/trillium/Mistral-7B-MaxText/README.md index 9a12f5b..100020e 100644 --- a/training/trillium/Mistral-7B-MaxText/README.md +++ b/training/trillium/Mistral-7B-MaxText/README.md @@ -1,12 +1,12 @@ # Instructions for training Mistral-7B-MaxText on TPU trillium (v6e-8) ## XPK setup -Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext ### Install MaxText and Build Docker Image -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, use the MaxText [tpu-recipes-v0.1.2](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.2) tag to run this recipe: ``` diff --git a/training/trillium/Mixtral-8x22B-MaxText/README.md b/training/trillium/Mixtral-8x22B-MaxText/README.md index e1e4159..14d5afa 100644 --- a/training/trillium/Mixtral-8x22B-MaxText/README.md +++ b/training/trillium/Mixtral-8x22B-MaxText/README.md @@ -1,10 +1,10 @@ # Instructions for training Mixtral-8x22B-MaxText on TPU trillium ## XPK setup -Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext -Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build docker image +Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build docker image ## Run Maxtext Mixtral-8x7B workloads on GKE diff --git a/training/trillium/Mixtral-8x7B-MaxText/README.md b/training/trillium/Mixtral-8x7B-MaxText/README.md index 92ef4b2..50293de 100644 --- a/training/trillium/Mixtral-8x7B-MaxText/README.md +++ b/training/trillium/Mixtral-8x7B-MaxText/README.md @@ -1,12 +1,12 @@ # Instructions for training Mixtral-8x7B-MaxText on TPU trillium ## XPK setup -Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/XPK_README.md) to create your GKE cluster with XPK +Please follow the [XPK_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext ### Install MaxText and Build Docker Image -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, use the MaxText [tpu-recipes-v0.1.2](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.2) tag to run this recipe: ``` diff --git a/training/v5p/DeepSeek3-671B-MaxText/README.md b/training/v5p/DeepSeek3-671B-MaxText/README.md new file mode 100644 index 0000000..f179a4f --- /dev/null +++ b/training/v5p/DeepSeek3-671B-MaxText/README.md @@ -0,0 +1,61 @@ +# Instructions for training DeepSeek-671B-MaxText on TPU v5p-1024 + +## XPK setup +Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK + +## Prep for Maxtext + +1. Clone [Maxtext](https://github.com/AI-Hypercomputer/maxtext) repo +``` +git clone https://github.com/AI-Hypercomputer/maxtext.git +cd maxtext +git checkout 3eb77db3c94580f56f1b738f8d254b03bd205e35 +``` + +2. Run the following commands to build the docker image +``` +bash docker_build_dependency_image.sh DEVICE=tpu MODE=stable JAX_VERSION=0.7.0 +``` + +3. Create your new GCS bucket + +This is the GCS folder for storing test results. You can re-use any of your existing GCS buckets. To create a new bucket: +``` +GCS_PATH=gs://v5p-demo # +gcloud storage buckets create ${GCS_PATH} --project ${PROJECT} +``` + +4. Specify your workload enviroment variables +``` +export PROJECT=# +export ZONE=# +export CLUSTER_NAME=# +export OUTPUT_DIR=gs://v5p-demo/ # +export DEVICE_TYPE=${DEVICE_TYPE} # v5p-1024 for 512 v5p chips +``` + +## Run workloads + +5. From the MaxText root directory, start your workload: +``` +python3 -m benchmarks.benchmark_runner xpk \ +--project=$PROJECT \ +--zone=$ZONE \ +--device_type=${DEVICE_TYPE} \ +--num_slices=1 \ +--cluster_name=${CLUSTER_NAME} \ +--base_output_directory=${OUTPUT_DIR} \ +--model_name="deepseek3_671b_v5p_1024" \ +--base_docker_image=maxtext_base_image +``` + +6. Check the training log + +From your workload logs, you should see step time logs like the following, as training progresses: +``` +completed step: 11, seconds: 90.668, TFLOP/s/device: 152.415, Tokens/s/device: 542.108, total_weights: 25165824, loss: 10.989 +``` + +7. Workload configuration + +Workload configuration details can be found [here](https://github.com/AI-Hypercomputer/maxtext/blob/3eb77db3c94580f56f1b738f8d254b03bd205e35/benchmarks/maxtext_v5p_model_configs.py) in MaxText GitHub repo. Look for the configuration `deepseek3_671b_v5p_1024`. diff --git a/training/v5p/Diffusion-2-MaxDiffusion/README.md b/training/v5p/Diffusion-2-MaxDiffusion/README.md index 0e6169e..6a9d326 100644 --- a/training/v5p/Diffusion-2-MaxDiffusion/README.md +++ b/training/v5p/Diffusion-2-MaxDiffusion/README.md @@ -2,7 +2,7 @@ This documents present steps to run StableDiffusion [MaxDiffusion](https://github.com/google/maxdiffusion/tree/main/src/maxdiffusion) workload through [XPK](https://github.com/google/xpk/blob/main/README.md) tool. -Setup XPK and create cluster [XPK Userguide](Training/TPU-v5p/XPK_README.md) +Setup XPK and create cluster [XPK Userguide](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) Build a local docker image. diff --git a/training/v5p/GPT3-175B-MaxText/README.md b/training/v5p/GPT3-175B-MaxText/README.md index ad081d7..3d63491 100644 --- a/training/v5p/GPT3-175B-MaxText/README.md +++ b/training/v5p/GPT3-175B-MaxText/README.md @@ -1,7 +1,7 @@ # Instructions for training GPT3-175B-Maxtext on TPU v5p ## XPK setup -Please follow this [link](https://github.com/gclouduniverse/reproducibility/tree/main/Training/TPU-v5p/XPK_README.md) to create your GKE cluster with XPK +Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Prep for Maxtext GPT3-175B workloads on GKE 1. Clone [Maxtext](https://github.com/AI-Hypercomputer/maxtext) repo and move to its directory diff --git a/training/v5p/Llama2-7B-Maxtext/README.md b/training/v5p/Llama2-7B-Maxtext/README.md index 0961274..432f05c 100644 --- a/training/v5p/Llama2-7B-Maxtext/README.md +++ b/training/v5p/Llama2-7B-Maxtext/README.md @@ -1,7 +1,7 @@ # Instructions for training Llama2-7B-Maxtext on TPU v5p ## XPK setup -Please follow this [link](https://github.com/gclouduniverse/reproducibility/tree/main/Training/TPU-v5p/XPK_README.md) to create your GKE cluster with XPK +Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK ## Run Maxtext Llama2-7B workloads on GKE 1. Clone [Maxtext](https://github.com/AI-Hypercomputer/maxtext) repo diff --git a/training/v5p/Llama3.1-405B-MaxText/README.md b/training/v5p/Llama3.1-405B-MaxText/README.md new file mode 100644 index 0000000..f53ae0f --- /dev/null +++ b/training/v5p/Llama3.1-405B-MaxText/README.md @@ -0,0 +1,63 @@ +# Instructions for training Llama3.1-405B-MaxText on TPU v5p-1024 + +## XPK setup +Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK + +## Prep for Maxtext + +1. Clone [Maxtext](https://github.com/AI-Hypercomputer/maxtext) repo +``` +git clone https://github.com/AI-Hypercomputer/maxtext.git +cd maxtext +git checkout 3eb77db3c94580f56f1b738f8d254b03bd205e35 +``` + +2. Run the following commands to build the docker image +``` +bash docker_build_dependency_image.sh DEVICE=tpu MODE=stable JAX_VERSION=0.7.0 +``` + +3. Create your new GCS bucket + +This is the GCS folder for storing test results. You can re-use any of your existing GCS buckets. To create a new bucket: +``` +GCS_PATH=gs://v5p-demo # +gcloud storage buckets create ${GCS_PATH} --project ${PROJECT} +``` + +4. Specify your workload enviroment variables +``` +export PROJECT=# +export ZONE=# +export CLUSTER_NAME=# +export OUTPUT_DIR=gs://v5p-demo/ # +export DEVICE_TYPE=${DEVICE_TYPE} # v5p-1024 for 512 v5p chips +``` + +## Run workloads + +5. From the MaxText root directory, start your workload: +``` +python3 -m benchmarks.benchmark_runner xpk \ +--project=$PROJECT \ +--zone=$ZONE \ +--device_type=${DEVICE_TYPE} \ +--num_slices=1 \ +--cluster_name=${CLUSTER_NAME} \ +--base_output_directory=${OUTPUT_DIR} \ +--model_name="llama3_1_405b_8192_v5p_1024" \ +--base_docker_image=maxtext_base_image +``` + +6. Check the training log + +From your workload logs, you should see step time logs like the following, as training progresses: +``` +completed step: 10, seconds: 131.474, TFLOP/s/device: 314.530, Tokens/s/device: 124.618, total_weights: 8388608, loss: 4.453 +``` + +7. Workload configuration + +Workload configuration details can be found [here](https://github.com/AI-Hypercomputer/maxtext/blob/3eb77db3c94580f56f1b738f8d254b03bd205e35/benchmarks/maxtext_v5p_model_configs.py) in MaxText GitHub repo. Look for the configuration `llama3_1_405b_8192_v5p_1024`. + +Please note that this configuration is appropriate for v5p-256, v5p-512, and v5p-1024. \ No newline at end of file diff --git a/training/v5p/Llama4-Maverick-17B-128E-Maxtext/README.md b/training/v5p/Llama4-Maverick-17B-128E-Maxtext/README.md index 81d69b1..3d68cc5 100644 --- a/training/v5p/Llama4-Maverick-17B-128E-Maxtext/README.md +++ b/training/v5p/Llama4-Maverick-17B-128E-Maxtext/README.md @@ -4,11 +4,11 @@ This documents present steps to run Llama4-Maverick-17B-128E [MaxText](https://g ## XPK setup -Please follow this [link](https://github.com/gclouduniverse/reproducibility/tree/main/Training/TPU-v5p/XPK_README.md) to create your GKE cluster with XPK. +Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK. ## Prep for Maxtext -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: In step 1, Use the MaxText [tpu-recipes-v0.1.3](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.3) tag to run this recipe: ``` diff --git a/training/v5p/Llama4-Scout-17B-16E-Maxtext/README.md b/training/v5p/Llama4-Scout-17B-16E-Maxtext/README.md index 37abae9..6c63dea 100644 --- a/training/v5p/Llama4-Scout-17B-16E-Maxtext/README.md +++ b/training/v5p/Llama4-Scout-17B-16E-Maxtext/README.md @@ -1,35 +1,51 @@ -# Instructions for training Llama4-Scout-17B-16E Maxtext on TPU v5p-256 +# Instructions for training Llama4-Scout-17B-16E Maxtext on TPU v5p-256, v5p-512, and v5p-1024 This documents present steps to run Llama4-Scout-17B-16E [MaxText](https://github.com/google/maxtext) workload through [XPK](https://github.com/google/xpk/blob/main/README.md) tool. ## XPK setup -Please follow this [link](https://github.com/gclouduniverse/reproducibility/tree/main/Training/TPU-v5p/XPK_README.md) to create your GKE cluster with XPK. +Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK. ## Prep for Maxtext -Please follow the [MAXTEXT_README](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/MAXTEXT_README.md) to install maxtext and build the docker image. The following variables should be set: +1. Clone [Maxtext](https://github.com/AI-Hypercomputer/maxtext) repo +``` +git clone https://github.com/AI-Hypercomputer/maxtext.git +cd maxtext +git checkout 3eb77db3c94580f56f1b738f8d254b03bd205e35 +``` -In step 1, Use the MaxText [tpu-recipes-v0.1.3](https://github.com/AI-Hypercomputer/maxtext/releases/tag/tpu-recipes-v0.1.3) tag to run this recipe: +2. Run the following commands to build the docker image ``` -git checkout tpu-recipes-v0.1.3 +bash docker_build_dependency_image.sh DEVICE=tpu MODE=stable JAX_VERSION=0.7.0 ``` -In step 3, use the jax-stable-stack image containing JAX 0.5.2: +3. Create your new GCS bucket + +This is the GCS folder for storing test results. You can re-use any of your existing GCS buckets. To create a new bucket: ``` -BASE_IMAGE=us-docker.pkg.dev/cloud-tpu-images/jax-stable-stack/tpu:jax0.5.2-rev1 -bash docker_build_dependency_image.sh DEVICE=tpu MODE=stable_stack BASEIMAGE=${BASE_IMAGE} +GCS_PATH=gs://v5p-demo # +gcloud storage buckets create ${GCS_PATH} --project ${PROJECT} ``` -## Run workloads +4. Specify your workload enviroment variables -From the MaxText root directory, start your workload +``` +export PROJECT=# +export ZONE=# +export CLUSTER_NAME=# +export OUTPUT_DIR=gs://v5p-demo/ # +export DEVICE_TYPE=${DEVICE_TYPE} # v5p-256, v5p-512, or v5p-1024 +``` +## Run workloads + +5. From the MaxText root directory, start your workload: ``` python3 -m benchmarks.benchmark_runner xpk \ --project=$PROJECT \ --zone=$ZONE \ - --device_type=v5p-256 \ + --device_type=${DEVICE_TYPE} \ --num_slices=1 \ --cluster_name=${CLUSTER_NAME} \ --base_output_directory=${OUTPUT_DIR} \ @@ -37,51 +53,14 @@ python3 -m benchmarks.benchmark_runner xpk \ --base_docker_image=maxtext_base_image ``` -From your workload logs, you should start seeing step time logs like the following: +6. Check the training log +From your workload logs, you should see step time logs like the following, as training progresses: ``` -completed step: 12, seconds: 31.494, TFLOP/s/device: 251.760, Tokens/s/device: 2080.892, total_weights: 8388608, loss: 10.929 +completed step: 11, seconds: 31.652, TFLOP/s/device: 225.491, Tokens/s/device: 2070.487, total_weights: 33554432, loss: 11.825 ``` -Workload details can be found in `MaxText@tpu-recipes-v0.1.3` [here](https://github.com/AI-Hypercomputer/maxtext/blob/9ca35d7e60b71303b9f6fa885447d32e8a612c47/benchmarks/maxtext_v5p_model_configs.py#L109-L149): +7. Workload configuration + +Workload configuration details can be found [here](https://github.com/AI-Hypercomputer/maxtext/blob/3eb77db3c94580f56f1b738f8d254b03bd205e35/benchmarks/maxtext_v5p_model_configs.py) in MaxText GitHub repo. Look for the configuration `llama4_scout_dropless_v5p_256`. -``` - MaxTextModel( - model_name="llama4_scout_dropless_v5p_256", - model_type="llama4-17b-16e", - tuning_params={ - "per_device_batch_size": 8, - "max_target_length": 8192, - "ici_fsdp_parallelism": -1, - "enable_checkpointing": False, - "dtype": "bfloat16", - "weight_dtype": "float32", - "megablox": True, - "sparse_matmul": True, - "dataset_type": "synthetic", - "opt_type": "adamw", - "skip_first_n_steps_for_profiler": 5, - "profiler_steps": 3, - "profiler": "xplane", - "remat_policy": "custom", - "decoder_layer_input": "offload", - "reuse_example_batch": 1, - "sa_block_q": 2048, - "sa_block_kv": 2048, - "sa_block_kv_compute": 2048, - "sa_block_q_dkv": 2048, - "sa_block_kv_dkv": 2048, - "sa_block_kv_dkv_compute": 2048, - "sa_block_q_dq": 2048, - "sa_block_kv_dq": 2048, - "tokenizer_path": "meta-llama/Llama-4-Scout-17B-16E", - }, - xla_flags=( - xla_flags_library.MOE_VMEM_LIMIT_FLAG - + xla_flags_library.CF_FOR_ALL_GATHER - + xla_flags_library.DATA_PARALLEL_OVERLAP - + xla_flags_library.LAYOUT_FOR_ALL_REDUCE_SCATTER - + xla_flags_library.HOST_OFFLOAD_FLAGS - ), - ) -``` diff --git a/training/v5p/Mixtral-8X7B-Maxtext/README.md b/training/v5p/Mixtral-8X7B-Maxtext/README.md index f84ca5a..8423080 100644 --- a/training/v5p/Mixtral-8X7B-Maxtext/README.md +++ b/training/v5p/Mixtral-8X7B-Maxtext/README.md @@ -4,7 +4,7 @@ This documents present steps to run Mixtral-8x7B [MaxText](https://github.com/go ## XPK setup -Please follow this [link](https://github.com/gclouduniverse/reproducibility/tree/main/Training/TPU-v5p/XPK_README.md) to create your GKE cluster with XPK. +Please follow this [link](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) to create your GKE cluster with XPK. ## Run script diff --git a/training/v5p/SDXL-MaxDiffusion/README.md b/training/v5p/SDXL-MaxDiffusion/README.md index 5a7c5ff..de32a80 100644 --- a/training/v5p/SDXL-MaxDiffusion/README.md +++ b/training/v5p/SDXL-MaxDiffusion/README.md @@ -2,7 +2,7 @@ This documents present steps to run StableDiffusion [MaxDiffusion](https://github.com/google/maxdiffusion/tree/main/src/maxdiffusion) workload through [XPK](https://github.com/google/xpk/blob/main/README.md) tool. -Setup XPK and create cluster [XPK Userguide](../../../Training/TPU-v5p/XPK_README.md) +Setup XPK and create cluster [XPK Userguide](https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/XPK_README.md) Build a local docker image. diff --git a/training/v5p/XPK_README.md b/training/v5p/XPK_README.md deleted file mode 100644 index 4bb23a1..0000000 --- a/training/v5p/XPK_README.md +++ /dev/null @@ -1,82 +0,0 @@ -## Initialization -1. Run the following commands to initialize the project and zone. -``` -export PROJECT=tpu-prod-env-multipod # -export ZONE=us-central2-b # -gcloud config set project $PROJECT -gcloud config set compute/zone $ZONE -``` -2. Install XPK by following the [prerequisites](https://github.com/AI-Hypercomputer/xpk?tab=readme-ov-file#prerequisites) and [installation](https://github.com/AI-Hypercomputer/xpk?tab=readme-ov-file#installation) -instructions. Also ensure you have the proper [GCP permissions](https://github.com/AI-Hypercomputer/xpk?tab=readme-ov-file#installation). - -* In order to run the tpu-recipes as-is, run the `git clone` command from your home directory: -``` -git clone https://github.com/google/xpk.git -``` - -3. Run the rest of these commands from the cloned XPK directory: - -``` -cd xpk # Should be equivalent to cd ~/xpk -``` - - -## GKE Cluster Creation -1. Specify your TPU GKE cluster configs. -``` -export CLUSTER_NAME=v5p-demo # -export NETWORK_NAME=${CLUSTER_NAME}-only-mtu9k -export NETWORK_FW_NAME=${NETWORK_NAME}-only-fw -export CLUSTER_ARGUMENTS="--network=${NETWORK_NAME} --subnetwork=${NETWORK_NAME}" -export TPU_TYPE=v5p-512 # -export NUM_SLICES=1 # -``` - -2. Create the network and firewall for this cluster if it doesn’t exist yet. -``` -gcloud compute networks create ${NETWORK_NAME} --mtu=8896 --project=${PROJECT} --subnet-mode=auto --bgp-routing-mode=regional -gcloud compute firewall-rules create ${NETWORK_FW_NAME} --network ${NETWORK_NAME} --allow tcp,icmp,udp --project=${PROJECT} -``` - -3. Create GKE cluster with TPU node-pools -``` -python3 xpk.py cluster create \ ---default-pool-cpu-machine-type=n1-standard-32 \ ---cluster ${CLUSTER_NAME} \ ---tpu-type=${TPU_TYPE} \ ---num-slices=${NUM_SLICES} \ ---custom-cluster-arguments="${CLUSTER_ARGUMENTS}" \ ---on-demand -``` - - * Noted: TPU has `reserved`, `on-demand`, `spot` quota. This example used the `on-demand` quota. If you have the reserved or spot quota, please refer to this [link](https://github.com/google/xpk?tab=readme-ov-file#cluster-create). - * If you want to check what quota you have, please refer to this [link](https://cloud.google.com/kubernetes-engine/docs/how-to/tpus#ensure-quota). - * You should be able to see your GKE cluster similar to this once it is created successfully:![image](https://github.com/user-attachments/assets/60743411-5ee5-4391-bb0e-7ffba4d91c1d) - - -4. Test your GKE cluster to make sure it is usable -``` -python3 xpk.py workload create \ ---cluster ${CLUSTER_NAME} \ ---workload hello-world-test \ ---tpu-type=${TPU_TYPE} \ ---num-slices=${NUM_SLICES} \ ---command "echo Hello World" -``` -* You should be able to to see results like this: ![image](https://github.com/user-attachments/assets/c33010a6-e109-411e-8fb5-afb4edb3fa72) - -5. You can also check your workload status with the following command: - ``` -python3 xpk.py workload list \ ---cluster ${CLUSTER_NAME} - ``` -6. For more information about XPK, please refer to this [link](https://github.com/google/xpk). - -## GKE Cluster Deletion -You can use the following command to delete GKE cluster: -``` -export CLUSTER_NAME=v5p-demo # - -python3 xpk.py cluster delete \ ---cluster $CLUSTER_NAME -``` \ No newline at end of file