Skip to content

Conversation

@kylasa
Copy link
Collaborator

@kylasa kylasa commented Nov 20, 2025

New workflow to trigger pre training for llama3.1 8B model using Primus/Megatron framework.

Copy link
Contributor

@wenxie-amd wenxie-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wenxie-amd wenxie-amd requested a review from Copilot November 21, 2025 09:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new GitHub Actions workflow to automate pre-training of the Llama3.1 8B model using the Primus/Megatron framework. The workflow includes Docker image building, environment setup, training execution, and artifact upload capabilities.

Key Changes

  • Added a workflow with build and train jobs that support multiple GPU runner configurations
  • Integrated Docker-based training environment with AWS S3 for log storage
  • Implemented validation checks for required data and model directories

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,156 @@
name: Primus Llama3.1 8B
run-name: Primus Llama3.1 8B | ${{ inputs.runner_label | "m13-21" }}
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid syntax for default value in workflow run-name. The pipe operator | is not valid YAML syntax for providing a default value. Use || instead: ${{ inputs.runner_label || 'm13-21' }}

Suggested change
run-name: Primus Llama3.1 8B | ${{ inputs.runner_label | "m13-21" }}
run-name: Primus Llama3.1 8B | ${{ inputs.runner_label || 'm13-21' }}

Copilot uses AI. Check for mistakes.
timeout-minutes: 60 # 1 hour
outputs:
image_tag: ${{ steps.base_docker_build.outputs.image_tag }}
if: ${{ inputs.image_tag == '' || inputs.image_tag == null }}
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow references inputs.image_tag but this input is not defined in the workflow_dispatch inputs section (lines 5-15). Either add this input definition or remove this condition.

Copilot uses AI. Check for mistakes.
- name: Setup Permissions
run: |
sudo chown -R $USER:$USER /home/$USER/action-runner/_work/mlperf-training
echo "$DOCKER_CREDENTIALS"
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable DOCKER_CREDENTIALS is referenced but never defined. This line should either be removed or the variable should be properly defined.

Suggested change
echo "$DOCKER_CREDENTIALS"

Copilot uses AI. Check for mistakes.
train:
timeout-minutes: 60 # 1 hour
needs: build
if: ${{ always() && (needs.build.result == 'success' || inputs.skip_build) }}
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow references inputs.skip_build but this input is not defined in the workflow_dispatch inputs section. Either add this input definition or remove it from the condition.

Copilot uses AI. Check for mistakes.
run: |
UUID=$(uuidgen)
echo "UUID: $UUID"
echo "UUID: $UUID" >> $GITHUB_ENV
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UUID environment variable is being set incorrectly. The correct format for setting environment variables in GitHub Actions is echo \"UUID=$UUID\" >> $GITHUB_ENV (note the equals sign, not a colon).

Suggested change
echo "UUID: $UUID" >> $GITHUB_ENV
echo "UUID=$UUID" >> $GITHUB_ENV

Copilot uses AI. Check for mistakes.
echo "UUID: $UUID" >> $GITHUB_ENV
echo "Launching Batch Run with ID: $UUID"

cd /workspace
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow attempts to change to /workspace directory, but this is running on a GitHub Actions runner, not inside a container. This will likely fail as /workspace may not exist on the runner. Consider using $GITHUB_WORKSPACE or a relative path.

Suggested change
cd /workspace
cd $GITHUB_WORKSPACE

Copilot uses AI. Check for mistakes.
Comment on lines +125 to +130
docker exec -it dev_primus bash

cd Primus && pip install -r requirements.txt

# Run Training
EXP=examples/megatron/configs/llama3.1_8B-pretrain-mlperf.yaml bash ./examples/run_pretrain_mlperf.sh
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using docker exec -it with interactive terminal flags in a CI environment will cause the workflow to hang indefinitely. Remove the -it flags and provide the command to execute directly, or restructure to run commands within the container non-interactively.

Suggested change
docker exec -it dev_primus bash
cd Primus && pip install -r requirements.txt
# Run Training
EXP=examples/megatron/configs/llama3.1_8B-pretrain-mlperf.yaml bash ./examples/run_pretrain_mlperf.sh
docker exec dev_primus bash -c "cd Primus && pip install -r requirements.txt && EXP=examples/megatron/configs/llama3.1_8B-pretrain-mlperf.yaml bash ./examples/run_pretrain_mlperf.sh"

Copilot uses AI. Check for mistakes.
${{ env.LOGDIR }}/${{ env.UUID }}

- name: Upload Batch Training Logs
if: ${{ !inputs.skip_batch_training }}
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow references inputs.skip_batch_training but this input is not defined in the workflow_dispatch inputs section. Either add this input definition or remove this condition.

Copilot uses AI. Check for mistakes.
docker-user-name: ${{ secrets.DOCKER_USER_NAME }}
docker-pass-key: ${{ secrets.DOCKER_PASS_KEY }}
dockerfile: Dockerfile
custom-image-tag: ${{ env.BASE_IMAGE_TAG }}
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace detected at the end of line 59. Remove the extra spaces for consistency.

Suggested change
custom-image-tag: ${{ env.BASE_IMAGE_TAG }}
custom-image-tag: ${{ env.BASE_IMAGE_TAG }}

Copilot uses AI. Check for mistakes.
echo "MODEL: $MODEL"
echo "DATADIR: $DATADIR"
echo "MODEL_NAME: $MODEL_NAME"

Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace detected at the end of line 88. Remove the extra spaces for consistency.

Suggested change

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants