Add GitHub Actions workflow for Llama3.1 8B training #297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

kylasa wants to merge 1 commit into AMD-AGI:feature/mlcommons from kylasa:patch-1

Collaborator

kylasa commented Nov 20, 2025

New workflow to trigger pre training for llama3.1 8B model using Primus/Megatron framework.


          Add GitHub Actions workflow for Llama3.1 8B training

72196fe

kylasa requested review from Xiaoming-AMD, limou102 and wenxie-amd as code owners

November 20, 2025 00:09

wenxie-amd approved these changes

View reviewed changes

Contributor

wenxie-amd left a comment

LGTM

wenxie-amd requested a review from Copilot

November 21, 2025 09:40

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull Request Overview

This PR introduces a new GitHub Actions workflow to automate pre-training of the Llama3.1 8B model using the Primus/Megatron framework. The workflow includes Docker image building, environment setup, training execution, and artifact upload capabilities.

Key Changes

Added a workflow with build and train jobs that support multiple GPU runner configurations
Integrated Docker-based training environment with AWS S3 for log storage
Implemented validation checks for required data and model directories

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.github/workflows/pretrain_llama31_8b.yaml

		@@ -0,0 +1,156 @@
		name: Primus Llama3.1 8B
		run-name: Primus Llama3.1 8B \| ${{ inputs.runner_label \| "m13-21" }}

Copilot AI Nov 21, 2025

Invalid syntax for default value in workflow run-name. The pipe operator | is not valid YAML syntax for providing a default value. Use || instead: ${{ inputs.runner_label || 'm13-21' }}

Suggested change

      
            run-name: Primus Llama3.1 8B | ${{ inputs.runner_label | "m13-21" }}
          
            run-name: Primus Llama3.1 8B | ${{ inputs.runner_label || 'm13-21' }}

Copilot uses AI. Check for mistakes.

.github/workflows/pretrain_llama31_8b.yaml

+                  timeout-minutes: 60 # 1 hour
+                  outputs:
+                    image_tag: ${{ steps.base_docker_build.outputs.image_tag }}
+                  if: ${{ inputs.image_tag == '' || inputs.image_tag == null }}

Copilot AI Nov 21, 2025

The workflow references inputs.image_tag but this input is not defined in the workflow_dispatch inputs section (lines 5-15). Either add this input definition or remove this condition.

Copilot uses AI. Check for mistakes.

.github/workflows/pretrain_llama31_8b.yaml

+                    - name: Setup Permissions
+                      run: |
+                        sudo chown -R $USER:$USER /home/$USER/action-runner/_work/mlperf-training
+                        echo "$DOCKER_CREDENTIALS"

Copilot AI Nov 21, 2025

The variable DOCKER_CREDENTIALS is referenced but never defined. This line should either be removed or the variable should be properly defined.

Suggested change

echo "$DOCKER_CREDENTIALS"

Copilot uses AI. Check for mistakes.

.github/workflows/pretrain_llama31_8b.yaml

+                train:
+                  timeout-minutes: 60 # 1 hour
+                  needs: build
+                  if: ${{ always() && (needs.build.result == 'success' || inputs.skip_build) }}

Copilot AI Nov 21, 2025

The workflow references inputs.skip_build but this input is not defined in the workflow_dispatch inputs section. Either add this input definition or remove it from the condition.

Copilot uses AI. Check for mistakes.

.github/workflows/pretrain_llama31_8b.yaml

+                      run: |
+                        UUID=$(uuidgen)
+                        echo "UUID: $UUID"
+                        echo "UUID: $UUID" >> $GITHUB_ENV

Copilot AI Nov 21, 2025

The UUID environment variable is being set incorrectly. The correct format for setting environment variables in GitHub Actions is echo \"UUID=$UUID\" >> $GITHUB_ENV (note the equals sign, not a colon).

Suggested change

      
                      echo "UUID: $UUID" >> $GITHUB_ENV
          
                      echo "UUID=$UUID" >> $GITHUB_ENV

Copilot uses AI. Check for mistakes.

.github/workflows/pretrain_llama31_8b.yaml

+                        echo "UUID: $UUID" >> $GITHUB_ENV
+                        echo "Launching Batch Run with ID: $UUID"
+                        cd /workspace

Copilot AI Nov 21, 2025

The workflow attempts to change to /workspace directory, but this is running on a GitHub Actions runner, not inside a container. This will likely fail as /workspace may not exist on the runner. Consider using $GITHUB_WORKSPACE or a relative path.

Suggested change

      
                      cd /workspace
          
                      cd $GITHUB_WORKSPACE

Copilot uses AI. Check for mistakes.

.github/workflows/pretrain_llama31_8b.yaml

Comment on lines +125 to +130

+                        docker exec -it dev_primus bash
+                        cd Primus && pip install -r requirements.txt
+                        # Run Training
+                        EXP=examples/megatron/configs/llama3.1_8B-pretrain-mlperf.yaml bash ./examples/run_pretrain_mlperf.sh

Copilot AI Nov 21, 2025

Using docker exec -it with interactive terminal flags in a CI environment will cause the workflow to hang indefinitely. Remove the -it flags and provide the command to execute directly, or restructure to run commands within the container non-interactively.

Suggested change

      
                      docker exec -it dev_primus bash
          
                      cd Primus && pip install -r requirements.txt
          
                      # Run Training
          
                      EXP=examples/megatron/configs/llama3.1_8B-pretrain-mlperf.yaml bash ./examples/run_pretrain_mlperf.sh
          
                      docker exec dev_primus bash -c "cd Primus && pip install -r requirements.txt && EXP=examples/megatron/configs/llama3.1_8B-pretrain-mlperf.yaml bash ./examples/run_pretrain_mlperf.sh"

Copilot uses AI. Check for mistakes.

.github/workflows/pretrain_llama31_8b.yaml

+                          ${{ env.LOGDIR }}/${{ env.UUID }}
+                    - name: Upload Batch Training Logs
+                      if: ${{ !inputs.skip_batch_training }}

Copilot AI Nov 21, 2025

The workflow references inputs.skip_batch_training but this input is not defined in the workflow_dispatch inputs section. Either add this input definition or remove this condition.

Copilot uses AI. Check for mistakes.

.github/workflows/pretrain_llama31_8b.yaml

+                        docker-user-name: ${{ secrets.DOCKER_USER_NAME }}
+                        docker-pass-key: ${{ secrets.DOCKER_PASS_KEY }}
+                        dockerfile: Dockerfile
+                        custom-image-tag: ${{ env.BASE_IMAGE_TAG }}

Copilot AI Nov 21, 2025

Trailing whitespace detected at the end of line 59. Remove the extra spaces for consistency.

Suggested change

      
                      custom-image-tag: ${{ env.BASE_IMAGE_TAG }}    
          
                      custom-image-tag: ${{ env.BASE_IMAGE_TAG }}

Copilot uses AI. Check for mistakes.

.github/workflows/pretrain_llama31_8b.yaml

+                        echo "MODEL: $MODEL"
+                        echo "DATADIR: $DATADIR"
+                        echo "MODEL_NAME: $MODEL_NAME"

Copilot AI Nov 21, 2025

Trailing whitespace detected at the end of line 88. Remove the extra spaces for consistency.

Suggested change

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Copilot code review Copilot Copilot left review comments

wenxie-amd wenxie-amd approved these changes

Xiaoming-AMD Awaiting requested review from Xiaoming-AMD Xiaoming-AMD is a code owner

limou102 Awaiting requested review from limou102 limou102 is a code owner

Labels

None yet