Skip to content

Add optimized megatron version#516

Merged
jgphpc merged 1 commit intoeth-cscs:mainfrom
aurianer:add_amd_optimized_megatronlm
Feb 9, 2026
Merged

Add optimized megatron version#516
jgphpc merged 1 commit intoeth-cscs:mainfrom
aurianer:add_amd_optimized_megatronlm

Conversation

@aurianer
Copy link
Contributor

@aurianer aurianer commented Feb 6, 2026

Add the optimized megatron version from the hackathon, also update huggingface cache to point to SCRATCH and update the maintainers compared to tischwab0911 version.

To run the tests:

$ reframe -C config/cscs.py \
-c checks/apps/pytorch/pytorch_megatronlm_amd_optimized.py \
--system=beverin:mi300 \
--exec-policy=serial \
-s $SCRATCH/reframe/stage  \
-o $SCRATCH/reframe/output \
--perflogdir=$SCRATCH/reframe/perflogs -r

@aurianer
Copy link
Contributor Author

aurianer commented Feb 6, 2026

cscs-ci run alps-beverin-uenv;MY_UENV=prgenv-gnu/25.07-6.3.3:v12

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optimized version of the Megatron-LM benchmark test for AMD MI300 GPUs, based on work from a hackathon. It includes performance optimizations such as NCCL tuning, FP8 support, and various parallelization strategies. The PR also updates the HuggingFace cache configuration to use the SCRATCH directory and updates maintainers information.

Changes:

  • Adds a global /iopsstor mount to the ContainerEngineMixin class affecting all container-based tests
  • Introduces PyTorchMegatronLM_AMD_Optimized test class with support for llama3-8b and llama3-70b models
  • Includes NCCL tuner configuration for optimized network communication

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
checks/mixins/container_engine.py Adds hardcoded /iopsstor mount to all container tests
checks/apps/pytorch/pytorch_megatronlm_amd_optimized.py New optimized Megatron-LM test with AMD-specific configurations, NCCL tuning, and enhanced parallelization features

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

'AITER_JIT_DIR': '$AITER_ROOT_DIR/jit',
'MEGATRON_LM_DIR': '$PWD/Megatron-LM',
'PYTHONPATH': '$MEGATRON_LM_DIR:$PYTHONPATH',
'NCCL_TUNER_CONFIG_FILE': '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/nccl_tuner.conf',
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded user-specific path '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/nccl_tuner.conf' makes this test non-portable and will fail for other users or environments. This should be made configurable via a variable or environment variable, or the file should be placed in a shared location accessible to all users.

Copilot uses AI. Check for mistakes.
Comment on lines 165 to 166
'NCCL_TUNER_CONFIG_FILE': '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/nccl_tuner.conf',
'NCCL_TUNER_PLUGIN': '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/libnccl-tuner.so',
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded user-specific path '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/libnccl-tuner.so' makes this test non-portable and will fail for other users or environments. This should be made configurable via a variable or environment variable, or the file should be placed in a shared location accessible to all users.

Suggested change
'NCCL_TUNER_CONFIG_FILE': '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/nccl_tuner.conf',
'NCCL_TUNER_PLUGIN': '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/libnccl-tuner.so',
'NCCL_TUNER_CONFIG_FILE': os.getenv(
'NCCL_TUNER_CONFIG_FILE',
'$SCRATCH/rccl-tuner/nccl_tuner.conf'
),
'NCCL_TUNER_PLUGIN': os.getenv(
'NCCL_TUNER_PLUGIN',
'$SCRATCH/rccl-tuner/libnccl-tuner.so'
),

Copilot uses AI. Check for mistakes.
]

training_cmd = (
f'export NCCL_TUNER_TUNING_FILE=/iopsstor/scratch/cscs/tschwab/Hackathon/logs/tuning/$SLURM_PROCID.csv \n'
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded user-specific path '/iopsstor/scratch/cscs/tschwab/Hackathon/logs/tuning/$SLURM_PROCID.csv' makes this test non-portable and will fail for other users or environments. This should be made configurable via a variable or environment variable, or use a location under the user's SCRATCH directory similar to how HF_HOME is handled.

Suggested change
f'export NCCL_TUNER_TUNING_FILE=/iopsstor/scratch/cscs/tschwab/Hackathon/logs/tuning/$SLURM_PROCID.csv \n'
f'export NCCL_TUNER_TUNING_FILE=${{NCCL_TUNER_TUNING_FILE:-$SCRATCH/Hackathon/logs/tuning/$SLURM_PROCID.csv}} \n'

Copilot uses AI. Check for mistakes.
sourcesdir = None
image = variable(
str,
value=('docker://rocm/megatron-lm:v25.5_py312')
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses container image version 'v25.5_py312' while the similar test in pytorch_megatronlm_amd.py uses 'v25.6_py312'. Consider using the newer version for consistency and to ensure you have the latest fixes and features, unless there's a specific reason to use the older version.

Suggested change
value=('docker://rocm/megatron-lm:v25.5_py312')
value=('docker://rocm/megatron-lm:v25.6_py312')

Copilot uses AI. Check for mistakes.
Comment on lines +261 to +262
f'--context-parallel-size ',
f'{model_config["context_parallel_size"]}',
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context-parallel-size flag and its value are incorrectly split across two separate list items. Line 261 contains '--context-parallel-size ' (with trailing space) and line 262 contains the value. This will result in two separate command-line arguments instead of one flag with its value. These two lines should be combined into a single f-string like the other configuration flags.

Suggested change
f'--context-parallel-size ',
f'{model_config["context_parallel_size"]}',
f'--context-parallel-size {model_config["context_parallel_size"]}',

Copilot uses AI. Check for mistakes.
@aurianer aurianer force-pushed the add_amd_optimized_megatronlm branch from ab5b82e to 86fe793 Compare February 6, 2026 17:44
@aurianer
Copy link
Contributor Author

aurianer commented Feb 6, 2026

Some of the paths were hardcoded by timo, for the nccl tuner config files, is there a place where we put such config files for reframe tests? I guess that would be cleaner to use that instead

Also update the maintainers compared to tischwab0911 version.
@aurianer aurianer force-pushed the add_amd_optimized_megatronlm branch from 86fe793 to 182b111 Compare February 6, 2026 18:10
@aurianer
Copy link
Contributor Author

aurianer commented Feb 6, 2026

As suggested by @jgphpc , I've moved the rccl tuner config files to /capstor/store/cscs/cscs/public/reframe/resources

@aurianer
Copy link
Contributor Author

aurianer commented Feb 6, 2026

cscs-ci run alps-beverin-uenv;MY_UENV=prgenv-gnu/25.07-6.3.3:v12

Copy link
Collaborator

@jgphpc jgphpc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add some minor changes (in another pr)

@jgphpc jgphpc merged commit 64b42a0 into eth-cscs:main Feb 9, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants