Conversation
|
cscs-ci run alps-beverin-uenv;MY_UENV=prgenv-gnu/25.07-6.3.3:v12 |
There was a problem hiding this comment.
Pull request overview
This PR adds an optimized version of the Megatron-LM benchmark test for AMD MI300 GPUs, based on work from a hackathon. It includes performance optimizations such as NCCL tuning, FP8 support, and various parallelization strategies. The PR also updates the HuggingFace cache configuration to use the SCRATCH directory and updates maintainers information.
Changes:
- Adds a global
/iopsstormount to the ContainerEngineMixin class affecting all container-based tests - Introduces
PyTorchMegatronLM_AMD_Optimizedtest class with support for llama3-8b and llama3-70b models - Includes NCCL tuner configuration for optimized network communication
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| checks/mixins/container_engine.py | Adds hardcoded /iopsstor mount to all container tests |
| checks/apps/pytorch/pytorch_megatronlm_amd_optimized.py | New optimized Megatron-LM test with AMD-specific configurations, NCCL tuning, and enhanced parallelization features |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 'AITER_JIT_DIR': '$AITER_ROOT_DIR/jit', | ||
| 'MEGATRON_LM_DIR': '$PWD/Megatron-LM', | ||
| 'PYTHONPATH': '$MEGATRON_LM_DIR:$PYTHONPATH', | ||
| 'NCCL_TUNER_CONFIG_FILE': '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/nccl_tuner.conf', |
There was a problem hiding this comment.
The hardcoded user-specific path '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/nccl_tuner.conf' makes this test non-portable and will fail for other users or environments. This should be made configurable via a variable or environment variable, or the file should be placed in a shared location accessible to all users.
| 'NCCL_TUNER_CONFIG_FILE': '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/nccl_tuner.conf', | ||
| 'NCCL_TUNER_PLUGIN': '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/libnccl-tuner.so', |
There was a problem hiding this comment.
The hardcoded user-specific path '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/libnccl-tuner.so' makes this test non-portable and will fail for other users or environments. This should be made configurable via a variable or environment variable, or the file should be placed in a shared location accessible to all users.
| 'NCCL_TUNER_CONFIG_FILE': '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/nccl_tuner.conf', | |
| 'NCCL_TUNER_PLUGIN': '/iopsstor/scratch/cscs/tschwab/Hackathon/rccl-tuner/libnccl-tuner.so', | |
| 'NCCL_TUNER_CONFIG_FILE': os.getenv( | |
| 'NCCL_TUNER_CONFIG_FILE', | |
| '$SCRATCH/rccl-tuner/nccl_tuner.conf' | |
| ), | |
| 'NCCL_TUNER_PLUGIN': os.getenv( | |
| 'NCCL_TUNER_PLUGIN', | |
| '$SCRATCH/rccl-tuner/libnccl-tuner.so' | |
| ), |
| ] | ||
|
|
||
| training_cmd = ( | ||
| f'export NCCL_TUNER_TUNING_FILE=/iopsstor/scratch/cscs/tschwab/Hackathon/logs/tuning/$SLURM_PROCID.csv \n' |
There was a problem hiding this comment.
The hardcoded user-specific path '/iopsstor/scratch/cscs/tschwab/Hackathon/logs/tuning/$SLURM_PROCID.csv' makes this test non-portable and will fail for other users or environments. This should be made configurable via a variable or environment variable, or use a location under the user's SCRATCH directory similar to how HF_HOME is handled.
| f'export NCCL_TUNER_TUNING_FILE=/iopsstor/scratch/cscs/tschwab/Hackathon/logs/tuning/$SLURM_PROCID.csv \n' | |
| f'export NCCL_TUNER_TUNING_FILE=${{NCCL_TUNER_TUNING_FILE:-$SCRATCH/Hackathon/logs/tuning/$SLURM_PROCID.csv}} \n' |
| sourcesdir = None | ||
| image = variable( | ||
| str, | ||
| value=('docker://rocm/megatron-lm:v25.5_py312') |
There was a problem hiding this comment.
This test uses container image version 'v25.5_py312' while the similar test in pytorch_megatronlm_amd.py uses 'v25.6_py312'. Consider using the newer version for consistency and to ensure you have the latest fixes and features, unless there's a specific reason to use the older version.
| value=('docker://rocm/megatron-lm:v25.5_py312') | |
| value=('docker://rocm/megatron-lm:v25.6_py312') |
| f'--context-parallel-size ', | ||
| f'{model_config["context_parallel_size"]}', |
There was a problem hiding this comment.
The context-parallel-size flag and its value are incorrectly split across two separate list items. Line 261 contains '--context-parallel-size ' (with trailing space) and line 262 contains the value. This will result in two separate command-line arguments instead of one flag with its value. These two lines should be combined into a single f-string like the other configuration flags.
| f'--context-parallel-size ', | |
| f'{model_config["context_parallel_size"]}', | |
| f'--context-parallel-size {model_config["context_parallel_size"]}', |
ab5b82e to
86fe793
Compare
|
Some of the paths were hardcoded by timo, for the nccl tuner config files, is there a place where we put such config files for reframe tests? I guess that would be cleaner to use that instead |
Also update the maintainers compared to tischwab0911 version.
86fe793 to
182b111
Compare
|
As suggested by @jgphpc , I've moved the rccl tuner config files to |
|
cscs-ci run alps-beverin-uenv;MY_UENV=prgenv-gnu/25.07-6.3.3:v12 |
jgphpc
left a comment
There was a problem hiding this comment.
I will add some minor changes (in another pr)
Add the optimized megatron version from the hackathon, also update huggingface cache to point to SCRATCH and update the maintainers compared to tischwab0911 version.
To run the tests: