Add optimized megatron version#516

Merged

jgphpc merged 1 commit intoeth-cscs:mainfrom

aurianer:add_amd_optimized_megatronlm

Feb 9, 2026

Contributor

aurianer commented Feb 6, 2026 •

edited by jgphpc

Loading

Add the optimized megatron version from the hackathon, also update huggingface cache to point to SCRATCH and update the maintainers compared to tischwab0911 version.

To run the tests:

$ reframe -C config/cscs.py \
-c checks/apps/pytorch/pytorch_megatronlm_amd_optimized.py \
--system=beverin:mi300 \
--exec-policy=serial \
-s $SCRATCH/reframe/stage  \
-o $SCRATCH/reframe/output \
--perflogdir=$SCRATCH/reframe/perflogs -r

Contributor Author

aurianer commented Feb 6, 2026

cscs-ci run alps-beverin-uenv;MY_UENV=prgenv-gnu/25.07-6.3.3:v12

jgphpc requested review from Copilot and jgphpc

February 6, 2026 11:44

Copilot started reviewing on behalf of jgphpc

February 6, 2026 11:44

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

This PR adds an optimized version of the Megatron-LM benchmark test for AMD MI300 GPUs, based on work from a hackathon. It includes performance optimizations such as NCCL tuning, FP8 support, and various parallelization strategies. The PR also updates the HuggingFace cache configuration to use the SCRATCH directory and updates maintainers information.

Changes:

Adds a global /iopsstor mount to the ContainerEngineMixin class affecting all container-based tests
Introduces PyTorchMegatronLM_AMD_Optimized test class with support for llama3-8b and llama3-70b models
Includes NCCL tuner configuration for optimized network communication

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File	Description
checks/mixins/container_engine.py	Adds hardcoded `/iopsstor` mount to all container tests
checks/apps/pytorch/pytorch_megatronlm_amd_optimized.py	New optimized Megatron-LM test with AMD-specific configurations, NCCL tuning, and enhanced parallelization features

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

checks/mixins/container_engine.py Outdated Show resolved Hide resolved

checks/apps/pytorch/pytorch_megatronlm_amd_optimized.py Outdated

+                          'AITER_JIT_DIR': '$AITER_ROOT_DIR/jit',
+                          'MEGATRON_LM_DIR': '$PWD/Megatron-LM',
+                          'PYTHONPATH': '$MEGATRON_LM_DIR:$PYTHONPATH',
+                          'NCCL_TUNER_CONFIG_FILE': '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/nccl_tuner.conf',

Copilot AI Feb 6, 2026

The hardcoded user-specific path '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/nccl_tuner.conf' makes this test non-portable and will fail for other users or environments. This should be made configurable via a variable or environment variable, or the file should be placed in a shared location accessible to all users.

Copilot uses AI. Check for mistakes.

checks/apps/pytorch/pytorch_megatronlm_amd_optimized.py Outdated

Comment on lines 165 to 166

		'NCCL_TUNER_CONFIG_FILE': '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/nccl_tuner.conf',
		'NCCL_TUNER_PLUGIN': '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/libnccl-tuner.so',

Copilot AI Feb 6, 2026

The hardcoded user-specific path '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/libnccl-tuner.so' makes this test non-portable and will fail for other users or environments. This should be made configurable via a variable or environment variable, or the file should be placed in a shared location accessible to all users.

Suggested change

      
                        'NCCL_TUNER_CONFIG_FILE': '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/nccl_tuner.conf',
          
                        'NCCL_TUNER_PLUGIN': '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/libnccl-tuner.so',
          
                        'NCCL_TUNER_CONFIG_FILE': os.getenv(
          
                            'NCCL_TUNER_CONFIG_FILE',
          
                            '$SCRATCH/rccl-tuner/nccl_tuner.conf'
          
                        ),
          
                        'NCCL_TUNER_PLUGIN': os.getenv(
          
                            'NCCL_TUNER_PLUGIN',
          
                            '$SCRATCH/rccl-tuner/libnccl-tuner.so'
          
                        ),

Copilot uses AI. Check for mistakes.

checks/apps/pytorch/pytorch_megatronlm_amd_optimized.py Outdated

+                      ]
+                      training_cmd = (
+                          f'export NCCL_TUNER_TUNING_FILE=/iopsstor/scratch/cscs/tschwab/Hackathon/logs/tuning/$SLURM_PROCID.csv \n'

Copilot AI Feb 6, 2026

The hardcoded user-specific path '/iopsstor/scratch/cscs/tschwab/Hackathon/logs/tuning/$SLURM_PROCID.csv' makes this test non-portable and will fail for other users or environments. This should be made configurable via a variable or environment variable, or use a location under the user's SCRATCH directory similar to how HF_HOME is handled.

Suggested change

      
                        f'export NCCL_TUNER_TUNING_FILE=/iopsstor/scratch/cscs/tschwab/Hackathon/logs/tuning/$SLURM_PROCID.csv \n'
          
                        f'export NCCL_TUNER_TUNING_FILE=${{NCCL_TUNER_TUNING_FILE:-$SCRATCH/Hackathon/logs/tuning/$SLURM_PROCID.csv}} \n'

Copilot uses AI. Check for mistakes.

checks/apps/pytorch/pytorch_megatronlm_amd_optimized.py

+                  sourcesdir = None
+                  image = variable(
+                      str,
+                      value=('docker://rocm/megatron-lm:v25.5_py312')

Copilot AI Feb 6, 2026

This test uses container image version 'v25.5_py312' while the similar test in pytorch_megatronlm_amd.py uses 'v25.6_py312'. Consider using the newer version for consistency and to ensure you have the latest fixes and features, unless there's a specific reason to use the older version.

Suggested change

      
                    value=('docker://rocm/megatron-lm:v25.5_py312')
          
                    value=('docker://rocm/megatron-lm:v25.6_py312')

Copilot uses AI. Check for mistakes.

checks/apps/pytorch/pytorch_megatronlm_amd_optimized.py

Comment on lines +261 to +262

		f'--context-parallel-size ',
		f'{model_config["context_parallel_size"]}',

Copilot AI Feb 6, 2026

The context-parallel-size flag and its value are incorrectly split across two separate list items. Line 261 contains '--context-parallel-size ' (with trailing space) and line 262 contains the value. This will result in two separate command-line arguments instead of one flag with its value. These two lines should be combined into a single f-string like the other configuration flags.

Suggested change

      
                        f'--context-parallel-size ',
          
                        f'{model_config["context_parallel_size"]}',
          
                        f'--context-parallel-size {model_config["context_parallel_size"]}',

Copilot uses AI. Check for mistakes.

checks/apps/pytorch/pytorch_megatronlm_amd_optimized.py Outdated Show resolved Hide resolved

aurianer force-pushed the add_amd_optimized_megatronlm branch from ab5b82e to 86fe793 Compare

February 6, 2026 17:44

Contributor Author

aurianer commented Feb 6, 2026

Some of the paths were hardcoded by timo, for the nccl tuner config files, is there a place where we put such config files for reframe tests? I guess that would be cleaner to use that instead


          Add optimized megatron version, change huggingface cache to scratch

182b111

Also update the maintainers compared to tischwab0911 version.

aurianer force-pushed the add_amd_optimized_megatronlm branch from 86fe793 to 182b111 Compare

February 6, 2026 18:10

Contributor Author

aurianer commented Feb 6, 2026

As suggested by @jgphpc , I've moved the rccl tuner config files to /capstor/store/cscs/cscs/public/reframe/resources

Contributor Author

aurianer commented Feb 6, 2026

cscs-ci run alps-beverin-uenv;MY_UENV=prgenv-gnu/25.07-6.3.3:v12

jgphpc approved these changes

View reviewed changes

Collaborator

jgphpc left a comment

I will add some minor changes (in another pr)

jgphpc merged commit 64b42a0 into eth-cscs:main

2 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet