-
Notifications
You must be signed in to change notification settings - Fork 15
Add GitHub Actions workflow for Llama3.1 8B training #297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/mlcommons
Are you sure you want to change the base?
Conversation
wenxie-amd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new GitHub Actions workflow to automate pre-training of the Llama3.1 8B model using the Primus/Megatron framework. The workflow includes Docker image building, environment setup, training execution, and artifact upload capabilities.
Key Changes
- Added a workflow with build and train jobs that support multiple GPU runner configurations
- Integrated Docker-based training environment with AWS S3 for log storage
- Implemented validation checks for required data and model directories
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,156 @@ | |||
| name: Primus Llama3.1 8B | |||
| run-name: Primus Llama3.1 8B | ${{ inputs.runner_label | "m13-21" }} | |||
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invalid syntax for default value in workflow run-name. The pipe operator | is not valid YAML syntax for providing a default value. Use || instead: ${{ inputs.runner_label || 'm13-21' }}
| run-name: Primus Llama3.1 8B | ${{ inputs.runner_label | "m13-21" }} | |
| run-name: Primus Llama3.1 8B | ${{ inputs.runner_label || 'm13-21' }} |
| timeout-minutes: 60 # 1 hour | ||
| outputs: | ||
| image_tag: ${{ steps.base_docker_build.outputs.image_tag }} | ||
| if: ${{ inputs.image_tag == '' || inputs.image_tag == null }} |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow references inputs.image_tag but this input is not defined in the workflow_dispatch inputs section (lines 5-15). Either add this input definition or remove this condition.
| - name: Setup Permissions | ||
| run: | | ||
| sudo chown -R $USER:$USER /home/$USER/action-runner/_work/mlperf-training | ||
| echo "$DOCKER_CREDENTIALS" |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable DOCKER_CREDENTIALS is referenced but never defined. This line should either be removed or the variable should be properly defined.
| echo "$DOCKER_CREDENTIALS" |
| train: | ||
| timeout-minutes: 60 # 1 hour | ||
| needs: build | ||
| if: ${{ always() && (needs.build.result == 'success' || inputs.skip_build) }} |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow references inputs.skip_build but this input is not defined in the workflow_dispatch inputs section. Either add this input definition or remove it from the condition.
| run: | | ||
| UUID=$(uuidgen) | ||
| echo "UUID: $UUID" | ||
| echo "UUID: $UUID" >> $GITHUB_ENV |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UUID environment variable is being set incorrectly. The correct format for setting environment variables in GitHub Actions is echo \"UUID=$UUID\" >> $GITHUB_ENV (note the equals sign, not a colon).
| echo "UUID: $UUID" >> $GITHUB_ENV | |
| echo "UUID=$UUID" >> $GITHUB_ENV |
| echo "UUID: $UUID" >> $GITHUB_ENV | ||
| echo "Launching Batch Run with ID: $UUID" | ||
|
|
||
| cd /workspace |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow attempts to change to /workspace directory, but this is running on a GitHub Actions runner, not inside a container. This will likely fail as /workspace may not exist on the runner. Consider using $GITHUB_WORKSPACE or a relative path.
| cd /workspace | |
| cd $GITHUB_WORKSPACE |
| docker exec -it dev_primus bash | ||
|
|
||
| cd Primus && pip install -r requirements.txt | ||
|
|
||
| # Run Training | ||
| EXP=examples/megatron/configs/llama3.1_8B-pretrain-mlperf.yaml bash ./examples/run_pretrain_mlperf.sh |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using docker exec -it with interactive terminal flags in a CI environment will cause the workflow to hang indefinitely. Remove the -it flags and provide the command to execute directly, or restructure to run commands within the container non-interactively.
| docker exec -it dev_primus bash | |
| cd Primus && pip install -r requirements.txt | |
| # Run Training | |
| EXP=examples/megatron/configs/llama3.1_8B-pretrain-mlperf.yaml bash ./examples/run_pretrain_mlperf.sh | |
| docker exec dev_primus bash -c "cd Primus && pip install -r requirements.txt && EXP=examples/megatron/configs/llama3.1_8B-pretrain-mlperf.yaml bash ./examples/run_pretrain_mlperf.sh" |
| ${{ env.LOGDIR }}/${{ env.UUID }} | ||
|
|
||
| - name: Upload Batch Training Logs | ||
| if: ${{ !inputs.skip_batch_training }} |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow references inputs.skip_batch_training but this input is not defined in the workflow_dispatch inputs section. Either add this input definition or remove this condition.
| docker-user-name: ${{ secrets.DOCKER_USER_NAME }} | ||
| docker-pass-key: ${{ secrets.DOCKER_PASS_KEY }} | ||
| dockerfile: Dockerfile | ||
| custom-image-tag: ${{ env.BASE_IMAGE_TAG }} |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing whitespace detected at the end of line 59. Remove the extra spaces for consistency.
| custom-image-tag: ${{ env.BASE_IMAGE_TAG }} | |
| custom-image-tag: ${{ env.BASE_IMAGE_TAG }} |
| echo "MODEL: $MODEL" | ||
| echo "DATADIR: $DATADIR" | ||
| echo "MODEL_NAME: $MODEL_NAME" | ||
|
|
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing whitespace detected at the end of line 88. Remove the extra spaces for consistency.
New workflow to trigger pre training for llama3.1 8B model using Primus/Megatron framework.