Skip to content

Commit ecac477

Browse files
shjwudpko3n1gtdenedimapihtaryoungeunkwon0405
authored
Merge remote-tracking branch 'nvidia/main' (#7)
* ci: Move test optimizer into its own bucket (NVIDIA#1909) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Use matrix for approval-bot Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Update function name Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Adjust approval-bot for copy-pr-bot Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Parametrize workflow Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Parametrize workflow Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Remove attribute Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Update container image tag to use GitHub SHA * chore: Remove file * ci: Fix approval bot Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Configure cherrypick bot (NVIDIA#1925) Signed-off-by: oliver könig <okoenig@nvidia.com> * Ci approve dev (NVIDIA#1933) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Update nightly schedule (NVIDIA#1934) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Bump pre-flight for runs on main/dev (NVIDIA#1935) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Allow skipping on main (NVIDIA#1936) Signed-off-by: oliver könig <okoenig@nvidia.com> * Ko3n1g/ci/pr template community bot (NVIDIA#1937) * ci: More granular unit tests buckets (NVIDIA#1932) Signed-off-by: oliver könig <okoenig@nvidia.com> * Add sequence packing to RL (NVIDIA#1911) Add sequence packing to RL * chore: Update template (NVIDIA#1939) Signed-off-by: oliver könig <okoenig@nvidia.com> * chore: Add description about who can merge (NVIDIA#1940) Signed-off-by: oliver könig <okoenig@nvidia.com> * Ko3n1g/ci/fix main on eos (NVIDIA#1938) Signed-off-by: oliver könig <okoenig@nvidia.com> * Ko3n1g/ci/internal mrs (NVIDIA#1942) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Fix branch of approval bot (NVIDIA#1944) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Approvalbot for other branches (NVIDIA#1947) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci(fix): Approval bot (NVIDIA#1949) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci(fix): Approval gate Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Approval gate rule Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Update golden values nightly Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Approval gate Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Approval bot Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Sync branches Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Smaller image Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Better output Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: sync branches Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Fix sync bot Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Finalize Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Finalize Signed-off-by: oliver könig <okoenig@nvidia.com> * Ko3n1g/ci/sync branches (NVIDIA#1956) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Increase time limit for main tests Signed-off-by: oliver könig <okoenig@nvidia.com> * Ko3n1g/ci/add milestone (NVIDIA#1951) Signed-off-by: oliver könig <okoenig@nvidia.com> * Remove M-FSDP testing under LTS environment (NVIDIA#1959) * ci: Run on push to release branch (NVIDIA#1960) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Add golden values for inference Signed-off-by: oliver könig <okoenig@nvidia.com> * Fix typo in rl section of CODEOWNERS (NVIDIA#1968) * ci: Update copyright checker (NVIDIA#1973) Signed-off-by: oliver könig <okoenig@nvidia.com> * Ko3n1g/ci/auto reminder GitHub (NVIDIA#1955) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Update secret Signed-off-by: oliver könig <okoenig@nvidia.com> * ci(fix): `Run tests` label (NVIDIA#1970) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci(hotfix): Disable tests again Signed-off-by: oliver könig <okoenig@nvidia.com> * ci(hotfix): Add merge-group to copyright check Signed-off-by: oliver könig <okoenig@nvidia.com> * ci(hotfix): Copyright check on merge-queue Signed-off-by: oliver könig <okoenig@nvidia.com> * zarr soft deprecation (NVIDIA#2004) Signed-off-by: dimapihtar <dpihtar@gmail.com> Co-authored-by: oliver könig <okoenig@nvidia.com> * Make `get_asyncio_loop` safe to use repeatedly (NVIDIA#1990) Co-authored-by: oliver könig <okoenig@nvidia.com> * Update symmetric registration interface to sync-up with upstream pytorch change (NVIDIA#1924) Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: Youngeun <kyeg9404@gmail.com> Co-authored-by: oliver könig <okoenig@nvidia.com> * chore: Update codeowners (NVIDIA#2012) Signed-off-by: oliver könig <okoenig@nvidia.com> * Deduplicate dynamic engine + coordinator. (NVIDIA#1981) Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> * Safely access state dict args in load ckpt (NVIDIA#1957) Signed-off-by: Maanu Grover <maanug@nvidia.com> * Allow mixed-batch sampling in dynamic inference (NVIDIA#1927) * Stop Nemo_CICD_Test from failing in forks (NVIDIA#2024) * Clean up dynamic inference step (NVIDIA#1992) Co-authored-by: Lawrence McAfee <lmcafee@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> * ci: Auto-update copy-pr-bot vetters (NVIDIA#1850) Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: AJ Schmidt <ajschmidt8@users.noreply.github.com> * Have datasets account for tokenizers which incorrectly define PAD (NVIDIA#2017) * ci: Enable integration tests (NVIDIA#2023) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci: Fix build-push-wheel workflow (NVIDIA#2022) Signed-off-by: oliver könig <okoenig@nvidia.com> * chore: Update tooling for interactive jobs (NVIDIA#2032) Signed-off-by: oliver könig <okoenig@nvidia.com> * revert(hotfix): ci: trustees_override (NVIDIA#2041) Signed-off-by: oliver könig <okoenig@nvidia.com> * add missing warnings import in model parallel config (NVIDIA#2039) Signed-off-by: ykarnati <ykarnati@nvidia.com> * Reduce-scatter implementation with FP32 accumulation (NVIDIA#1967) Signed-off-by: Deepak Narayanan <dnarayanan@nvidia.com> * ci(fix): Workflows on `main` (NVIDIA#2045) Signed-off-by: oliver könig <okoenig@nvidia.com> * build: Bump modelopt (NVIDIA#2046) Signed-off-by: oliver könig <okoenig@nvidia.com> * Remove TestCaptureFreezeGC unit test. (NVIDIA#1978) * ci: Add multi-approval action (NVIDIA#2051) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci(hotfix): Repair codeowners file * ci(hotfix): Set docs allowed to fail Signed-off-by: oliver könig <okoenig@nvidia.com> * Ko3n1g/ci/test iteration time (NVIDIA#2067) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci(hotfix): Remove performance for ckpt-resume Signed-off-by: oliver könig <okoenig@nvidia.com> * Allow inference test throughput to vary by 10% (NVIDIA#2070) * ci(hotfix): Inference test pipeline Signed-off-by: oliver könig <okoenig@nvidia.com> * chore: Fix autoformatter (NVIDIA#2073) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci(hotfix): Remove iteration-time from t5 Signed-off-by: oliver könig <okoenig@nvidia.com> * ci(hotfix): disable inference test Signed-off-by: oliver könig <okoenig@nvidia.com> * ci(hotfix): Disable inference test Signed-off-by: oliver könig <okoenig@nvidia.com> * ci(hotfix): Bypass approvalbot in merge-queue (NVIDIA#2082) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci(hotfix): Enable merge-group for approval bot Signed-off-by: oliver könig <okoenig@nvidia.com> * chore: Update local tooling (NVIDIA#2066) Signed-off-by: oliver könig <okoenig@nvidia.com> * Add extra RL files (NVIDIA#2077) Co-authored-by: Robert Kirby <rkirby@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> * Prevent summary jobs from running in forks (NVIDIA#2083) Co-authored-by: oliver könig <okoenig@nvidia.com> * ci: Fix test scope (NVIDIA#2091) Signed-off-by: oliver könig <okoenig@nvidia.com> * ci(hotfix): Remove publish workflows Signed-off-by: oliver könig <okoenig@nvidia.com> * Refactor the attention metadata into separate classes (NVIDIA#2001) Co-authored-by: Siddharth Singh <136645615+sidsingh-nvidia@users.noreply.github.com> Co-authored-by: oliver könig <okoenig@nvidia.com> * Guard against incorrectly using MoE prefill graphs (NVIDIA#2030) Co-authored-by: oliver könig <okoenig@nvidia.com> * Revert "Refactor the attention metadata into separate classes (NVIDIA#2001)" This reverts commit a652e2c. * Run mr-slim tests in lightweight-mode (NVIDIA#2106) Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Inference | Lazy compile UVM allocator. (NVIDIA#1977) Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> * chore: Reenable trustees (NVIDIA#2108) Signed-off-by: oliver könig <okoenig@nvidia.com> * Revert "Inference | Lazy compile UVM allocator. (NVIDIA#1977)" This reverts commit 7487c53. * ci(fix): Changeset of copyright checker (NVIDIA#2110) Signed-off-by: oliver könig <okoenig@nvidia.com> * Ko3n1g/chore/update release settings (NVIDIA#2097) Signed-off-by: oliver könig <okoenig@nvidia.com> * Remove unnecessary check on rotary_pos_cos (NVIDIA#2003) Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com> * (Reverted) Inference | Lazy compile UVM allocator. (NVIDIA#2125) Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> * Refactor Attention Metadata to Separate Classes (NVIDIA#2112) Co-authored-by: Siddharth Singh <136645615+sidsingh-nvidia@users.noreply.github.com> Co-authored-by: oliver könig <okoenig@nvidia.com> * Refactor model_provider to model_builder format for ModelOpt examples (NVIDIA#2107) * wandb Inference stats logging (NVIDIA#2026) Co-authored-by: root <root@gpu-h100-0058.cm.cluster> Co-authored-by: William Dykas <wdykas@cw-pdx-cs-001-vscode-02.cm.cluster> Co-authored-by: root <root@gpu-h100-0220.cm.cluster> * Make `PipelineParallelLayout` always return str from ` __repr__` (NVIDIA#2055) Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> * Add flash_attn_3 as first option for FA3 import (NVIDIA#2010) Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Teodor-Dumitru Ene <34819528+tdene@users.noreply.github.com> * Add debugging hint for case when cudagraphs are created but no matching runner is found (NVIDIA#2129) * ci: LTS container (NVIDIA#2133) Signed-off-by: oliver könig <okoenig@nvidia.com> * Revert "ci: LTS container (NVIDIA#2133)" This reverts commit eb48e81. * Fix param init (NVIDIA#2033) Signed-off-by: Chen Cui <chcui@nvidia.com> * Hotfix to unit tests on hopper FA3 (NVIDIA#2143) * Add BytesIO to safe_globals (NVIDIA#2074) * add deprecation warning for legacy tokenizer system (NVIDIA#2145) Signed-off-by: dimapihtar <dpihtar@gmail.com> * replay: ci: Bump LTS container (NVIDIA#2157) Signed-off-by: oliver könig <okoenig@nvidia.com> * Hotfix to unit tests on hopper FA3 (bis) (NVIDIA#2179) * Fix has_modelopt_state() for native Torch checkpoint format (NVIDIA#2160) Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * chore: Remove codeowners (NVIDIA#2175) Signed-off-by: oliver könig <okoenig@nvidia.com> * Fix FP8 inference with sequence parallelism (NVIDIA#2009) Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com> Co-authored-by: Teodor-Dumitru Ene <34819528+tdene@users.noreply.github.com> * Replace ModelOpt generation server (NVIDIA#2147) Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Add hybrid model support for dynamic inference engine (NVIDIA#1907) Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com> Co-authored-by: Teodor-Dumitru Ene <tene@nvidia.com> * Async task and event loop safety in Megatron Core (NVIDIA#2025) Co-authored-by: Robert Kirby <ArEsKay3@users.noreply.github.com> * Rename skip_prompt_log_probs (NVIDIA#2181) * Dynamic inference context | UVM only. (NVIDIA#1983) Co-authored-by: Robert Kirby <rkirby@nvidia.com> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: Teodor-Dumitru Ene <tene@nvidia.com> * Update copy-pr-bot.yaml [skip ci] Signed-off-by: oliver könig <okoenig@nvidia.com> * Revert "Dynamic inference context | UVM only. (NVIDIA#1983)" This reverts commit d6979d6. * ci: Run `auto-update-copy-pr-bot` only on forks (NVIDIA#2191) Signed-off-by: oliver könig <okoenig@nvidia.com> * Inference throughput tests: refactor goldens to be in list format (NVIDIA#2072) * Enable TE custom quantization recipe (NVIDIA#2005) Signed-off-by: Evgeny <etsykunov@nvidia.com> Signed-off-by: root <Evgeny> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: root <Evgeny> * Remove redundant logits calculations in gpt_model --------- Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: Youngeun <kyeg9404@gmail.com> Signed-off-by: Maanu Grover <maanug@nvidia.com> Signed-off-by: ykarnati <ykarnati@nvidia.com> Signed-off-by: Deepak Narayanan <dnarayanan@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com> Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> Signed-off-by: Evgeny <etsykunov@nvidia.com> Signed-off-by: root <Evgeny> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Teodor-Dumitru Ene <34819528+tdene@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: Youngeun Kwon <youngeunk@nvidia.com> Co-authored-by: Lawrence McAfee <85179052+lmcafee-nvidia@users.noreply.github.com> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com> Co-authored-by: Lawrence McAfee <lmcafee@nvidia.com> Co-authored-by: AJ Schmidt <ajschmidt8@users.noreply.github.com> Co-authored-by: Yashaswi Karnati <144376261+yashaswikarnati@users.noreply.github.com> Co-authored-by: Deepak Narayanan <2724038+deepakn94@users.noreply.github.com> Co-authored-by: helen ngo <helenn@nvidia.com> Co-authored-by: Robert Kirby <rkirby@nvidia.com> Co-authored-by: kanz-nv <kanz@nvidia.com> Co-authored-by: Siddharth Singh <136645615+sidsingh-nvidia@users.noreply.github.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Keshav Santhanam <ksanthanam@nvidia.com> Co-authored-by: Asha Anoosheh <aanoosheh@nvidia.com> Co-authored-by: wdykas <73254672+wdykas@users.noreply.github.com> Co-authored-by: root <root@gpu-h100-0058.cm.cluster> Co-authored-by: William Dykas <wdykas@cw-pdx-cs-001-vscode-02.cm.cluster> Co-authored-by: root <root@gpu-h100-0220.cm.cluster> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Teodor-Dumitru Ene <tene@nvidia.com> Co-authored-by: Robert Kirby <ArEsKay3@users.noreply.github.com> Co-authored-by: Evgeny Tsykunov <e.tsykunov@gmail.com>
1 parent 052780e commit ecac477

File tree

573 files changed

+57090
-111590
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

573 files changed

+57090
-111590
lines changed

.github/CODEOWNERS

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,38 @@
1-
megatron/core @NVIDIA/core-adlr @NVIDIA/core-nemo
1+
megatron/core/ @NVIDIA/core-adlr @NVIDIA/core-nemo
22

33
megatron/core/models/gpt/ @NVIDIA/gpt
44

55
megatron/core/models/multimodal/ @NVIDIA/multi-modal
66

77
megatron/core/models/mamba/ @NVIDIA/hybrid-mamba
88

9+
megatron/core/datasets/ @NVIDIA/datasets
10+
11+
megatron/core/distributed/fsdp/ @NVIDIA/megatron-fsdp
12+
13+
megatron/core/transformer/fsdp_dtensor_checkpoint.py @NVIDIA/megatron-fsdp
14+
915
megatron/core/dist_checkpointing/ @NVIDIA/dist-checkpointing
1016

1117
megatron/core/optimizer/distrib_optimizer/ @NVIDIA/dist-optimizer
1218

1319
megatron/core/inference/modelopt_support @NVIDIA/quantization-and-inference
1420

15-
# megatron/core/datasets/ @NVIDIA/datasets
21+
megatron/core/datasets/ @NVIDIA/datasets
1622

1723
megatron/core/pipeline_parallel/ @NVIDIA/pipeline-parallelism
1824

1925
megatron/core/transformer/ @NVIDIA/core-adlr @NVIDIA/core-nemo
2026

2127
megatron/core/transformer/moe/ @NVIDIA/core-adlr @NVIDIA/core-devtech
2228

23-
# megatron/core/inference/ @NVIDIA/inference
29+
megatron/core/inference/ @NVIDIA/inference
2430

2531
megatron/core/parallel_state.py @NVIDIA/core-nemo
2632

2733
megatron/core/post_training/ @NVIDIA/post-training
28-
megatron/post_training @NVIDIA/post-training
34+
35+
megatron/post_training/ @NVIDIA/post-training
2936

3037
.gitlab/ @NVIDIA/ci
3138
.github/ @NVIDIA/ci
@@ -44,4 +51,4 @@ tests/unit_tests/ @NVIDIA/ci
4451
megatron/rl/ @NVIDIA/reinforcement-learning
4552
examples/rl/ @NVIDIA/reinforcement-learning
4653
test/unit_tests/test_rl_utils.py @NVIDIA/reinforcement-learning
47-
train_rl.py @NVIDIA/reinforcement-learning
54+
train_rl.py @NVIDIA/reinforcement-learning

.github/actions/action.yml

Lines changed: 18 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -78,19 +78,19 @@ runs:
7878
export PYTHONPATH=$(pwd)
7979
export NEMORUN_HOME=$(pwd)
8080
pip install --no-cache-dir uv
81-
uv sync --only-group test
81+
uv sync --only-group test
8282
uv run python tests/test_utils/python_scripts/launch_nemo_run_workload.py \
8383
--scope unit-tests \
8484
--model unit-tests \
85-
--test-case '${{ inputs.test_case }}' \
85+
--test-case "${{ inputs.test_case }}" \
8686
--environment dev \
8787
--platform dgx_h100 \
8888
--tag ${{ inputs.tag }} \
8989
--container-image ${{ inputs.container-image }}
9090
9191
RUN_TEST_EOF
9292
)
93-
echo "$cmd" | tee "job.sh"
93+
echo "$cmd" | tee "job.sh"
9494
echo "::endgroup::"
9595
9696
- name: Get PR info
@@ -125,23 +125,34 @@ runs:
125125
#!/bin/bash
126126
set -euxo pipefail
127127
128+
if [ "${{ steps.has-run-tests-label.outputs.main }}" == "true" ]; then
129+
ARGS=(
130+
--scope mr-github
131+
--enable-lightweight-mode
132+
)
133+
else
134+
ARGS=(
135+
--scope mr-slim
136+
--enable-lightweight-mode
137+
)
138+
fi
139+
128140
export PYTHONPATH=$(pwd)
129141
export NEMORUN_HOME=$(pwd)
130142
pip install --no-cache-dir uv
131-
uv sync --only-group test
143+
uv sync --only-group test
132144
uv run python tests/test_utils/python_scripts/launch_nemo_run_workload.py \
133-
--scope mr \
145+
${ARGS[@]} \
134146
--model ${{ inputs.model }} \
135147
--test-case ${{ inputs.test_case }} \
136148
--environment dev \
137149
--platform dgx_h100 \
138150
--container-image ${{ inputs.container-image }} \
139151
--data-dir /mnt/datadrive/TestData/megatron-lm/artifacts \
140-
--enable-lightweight-mode
141152
142153
RUN_TEST_EOF
143154
)
144-
echo "$cmd" | tee "job.sh"
155+
echo "$cmd" | tee "job.sh"
145156
echo "::endgroup::"
146157
147158
- name: Set timeout

.github/copy-pr-bot.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
enabled: true
22
auto_sync_draft: false
33
auto_sync_ready: true
4+
trustees_override: ["AAnoosheh", "ArEsKay3", "Autumn1998", "BestJuly", "BoxiangW", "ChenhanYu", "FDecaYed", "HaochenYuan", "ISEEKYAN", "JRD971000", "QiZhangNV", "ShriyaRishab", "Victarry", "Wohox", "ZhiyuLi-Nvidia", "aklife97", "ananthsub", "asolergi-nv", "buptzyb", "chtruong814", "cspades", "cuichenx", "deepakn94", "dimapihtar", "duncanriach", "erhoo82", "ericharper", "fanshiqing", "gautham-kollu", "hxbai", "jaredcasper", "jiemingz", "jkamalu", "jon-barker", "kanz-nv", "kevalmorabia97", "ko3n1g", "kunlunl", "kvareddy", "layalir", "lhb8125", "lmcafee-nvidia", "maanug-nv", "mathemakitten", "matthieule", "mehraakash", "mkhona-nvidia", "pablo-garay", "parthmannan", "pthombre", "rogerwaleffe", "sanandaraj5597", "santhnm2", "sbak5", "shanmugamr1992", "shifangx", "shjwudp", "sidsingh-nvidia", "skyw", "tdene", "theothermike", "thomasdhc", "trintamaki", "tylerpoon", "wdykas", "xiaoyao0115", "xuwchen", "yanring", "yaox12", "yaoyu-33", "yashaswikarnati", "yobibyte", "youngeunkwon0405", "yuzhongw-nvidia", "zhongbozhu"]

.github/pull_request_template.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# What does this PR do ?
2+
<!-- Add a one line overview of what this PR aims to accomplish. -->
3+
4+
:warning: For major changes (either in lines of code or in its impact), please make sure to first share discuss a design-doc with the team.
5+
6+
## Contribution process
7+
8+
```mermaid
9+
flowchart LR
10+
A[Pre-checks] --> B[PR Tests]
11+
subgraph Code Review/Approval
12+
C1[Expert Review] --> C2[Final Review]
13+
end
14+
B --> C1
15+
C2 --> D[Merge]
16+
```
17+
18+
### Pre-checks
19+
20+
- [ ] I want this PR in a versioned release and have added the appropriate Milestone (e.g., `Core 0.8`)
21+
- [ ] I have added relevant unit tests
22+
- [ ] I have added relevant functional tests
23+
- [ ] I have added proper typing to my code [Typing guidelines](https://docs.python.org/3/library/typing.html)
24+
- [ ] I have added relevant documentation
25+
- [ ] I have run the [autoformatter.sh](https://github.com/NVIDIA/Megatron-LM/blob/main/tools/autoformat.sh) on my PR
26+
27+
### Code review
28+
29+
The following process is enforced via the CODEOWNERS file for changes into `megatron/core`. For changes outside of `megatron/core`, it is up to the PR author whether or not to tag the Final Reviewer team.
30+
31+
<details>
32+
<summary>For MRs into `main` branch</summary>
33+
34+
#### (Step 1): Add PR label `Expert Review`
35+
36+
#### (Step 2): Collect the expert reviewers reviews
37+
38+
1. Attach the `Expert Review` label when your PR is ready for review.
39+
2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.
40+
41+
:warning: Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
42+
Final Review might get declined if these requirements are not fulfilled.
43+
44+
#### (Step 3): Final Review
45+
46+
1. Add `Final Review` label
47+
2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.
48+
49+
#### (Optional Step 4): Cherry-pick into release branch
50+
51+
If this PR also needs to be merged into `core_r*` release branches, after this PR has been merged, select `Cherry-pick` to open a new PR into the release branch.
52+
53+
</details>
54+
55+
<details>
56+
<summary>For MRs into `dev` branch</summary>
57+
The proposed review process for `dev` branch is under active discussion.
58+
59+
MRs are mergable after one approval by either `eharper@nvidia.com` or `zijiey@nvidia.com`.
60+
</details>
61+
62+
### Merging your PR
63+
64+
Any member of [core-adlr](https://github.com/orgs/teams/NVIDIA/core-adlr) and [`core-nemo`](https://github.com/orgs/teams/NVIDIA/core-nemo) will be able to merge your PR.
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
on:
2+
workflow_call:
3+
secrets:
4+
TWINE_USERNAME:
5+
required: true
6+
TWINE_PASSWORD:
7+
required: true
8+
9+
jobs:
10+
build-and-test-wheels:
11+
strategy:
12+
fail-fast: false
13+
matrix:
14+
include:
15+
- PACKAGE: megatron-core
16+
PLATFORM: arm64
17+
IMAGE: quay.io/pypa/manylinux_2_28_aarch64
18+
- PACKAGE: megatron-core
19+
PLATFORM: amd64
20+
IMAGE: quay.io/pypa/manylinux_2_28_x86_64
21+
- PACKAGE: megatron-fsdp
22+
IMAGE: quay.io/pypa/manylinux_2_28_x86_64
23+
PLATFORM: amd64
24+
runs-on: ${{ matrix.PLATFORM == 'amd64' && 'ubuntu-22.04' || 'ubuntu-22.04-arm' }}
25+
env:
26+
PACKAGE: ${{ matrix.PACKAGE }}
27+
IMAGE: ${{ matrix.IMAGE }}
28+
PLATFORM: ${{ matrix.PLATFORM }}
29+
steps:
30+
- name: Checkout repository
31+
uses: actions/checkout@v4
32+
33+
- name: Build wheel
34+
id: build-wheel
35+
run: |
36+
set -x
37+
38+
PUBLISH_DRYRUN=yes
39+
40+
if [ "$PACKAGE" = "megatron-core" ]; then
41+
ROOTDIR="megatron/core"
42+
BUILD_DIR="."
43+
elif [ "$PACKAGE" = "megatron-fsdp" ]; then
44+
ROOTDIR="megatron/core/distributed/fsdp/src/megatron_fsdp"
45+
BUILD_DIR="megatron/core/distributed/fsdp/src"
46+
else
47+
echo Unknown package: $PACKAGE
48+
exit 1
49+
fi
50+
51+
if [ "$PUBLISH_DRYRUN" = "yes" ]; then
52+
PRE_RELEASE=$(sed -n "s/.*PRE_RELEASE = '\(.*\)'/\1/p" $ROOTDIR/package_info.py)
53+
sed -i "/^PRE_RELEASE/c\PRE_RELEASE = '${PRE_RELEASE}.dev$((RANDOM % 900000 + 100000))'" $ROOTDIR/package_info.py
54+
fi
55+
56+
pushd $BUILD_DIR
57+
rm LICENSE || true
58+
docker run --rm -v $(pwd):/workspace -w /workspace $IMAGE bash -c '\
59+
for python_version in cp310 cp311 cp312 cp313; do \
60+
/opt/python/${python_version}-${python_version}/bin/pip install --upgrade "setuptools>=80.0.0" build; \
61+
done && \
62+
for python_version in cp310 cp311 cp312 cp313; do \
63+
/opt/python/${python_version}-${python_version}/bin/python -m build; \
64+
done \
65+
'
66+
67+
PLATFORM_WHEELS=$(find dist -name "*.whl" -not -name "*-none-any.whl")
68+
if [ -n "$PLATFORM_WHEELS" ]; then
69+
echo "Found platform wheels to repair: $PLATFORM_WHEELS"
70+
docker run --rm -v $(pwd):/workspace -w /workspace $IMAGE auditwheel repair $PLATFORM_WHEELS
71+
docker run --rm -v $(pwd):/workspace -w /workspace $IMAGE rm -rf dist/*.whl
72+
docker run --rm -v $(pwd):/workspace -w /workspace $IMAGE cp -a wheelhouse/* dist/
73+
fi
74+
popd
75+
76+
pushd $ROOTDIR
77+
EXPECTED_RELEASE_NUMBER=$(python -c "import package_info; print(package_info.__version__)")
78+
popd
79+
80+
echo "expected-release-number=$EXPECTED_RELEASE_NUMBER" | tee -a "${GITHUB_OUTPUT}"
81+
82+
if [ "$PACKAGE" = "megatron-fsdp" ]; then
83+
mkdir -p dist/
84+
cp -a megatron/core/distributed/fsdp/src/dist/* dist/
85+
fi
86+
87+
ls -al dist/
88+
89+
- name: Test wheels
90+
run: |
91+
ls -al dist/
92+
93+
if [ "$PACKAGE" = "megatron-core" ]; then
94+
ROOTPATH="megatron.core"
95+
WHEEL_PREFIX="megatron_core"
96+
elif [ "$PACKAGE" = "megatron-fsdp" ]; then
97+
ROOTPATH="megatron_fsdp"
98+
WHEEL_PREFIX="megatron_fsdp"
99+
else
100+
echo Unknown package: $PACKAGE
101+
exit 1
102+
fi
103+
104+
if [ "$PACKAGE" = "megatron-core" ]; then
105+
if [[ "$PLATFORM" == "arm64" ]]; then
106+
for file in dist/$WHEEL_PREFIX*cp310*aarch64.whl; do
107+
pip install --no-cache-dir "$file"
108+
done
109+
else
110+
for file in dist/$WHEEL_PREFIX*cp310*x86_64.whl; do
111+
pip install --no-cache-dir "$file"
112+
done
113+
fi
114+
else
115+
pip install --no-cache-dir dist/$WHEEL_PREFIX*.whl
116+
fi
117+
118+
sudo rm -rf megatron/
119+
120+
RELEASE_NUMBER=$(python -c "import $ROOTPATH; print($ROOTPATH.__version__)")
121+
test "${{ steps.build-wheel.outputs.expected-release-number }}" == "$RELEASE_NUMBER"
122+
123+
- name: Upload wheels
124+
uses: actions/upload-artifact@v4
125+
with:
126+
name: wheels-${{ matrix.PACKAGE }}-${{ matrix.PLATFORM }}
127+
path: dist/
128+
129+
publish-wheels:
130+
needs: [build-and-test-wheels]
131+
runs-on: ubuntu-latest
132+
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/r')) && 'main' || 'public' }}
133+
strategy:
134+
fail-fast: false
135+
matrix:
136+
include:
137+
- PACKAGE: megatron_core
138+
- PACKAGE: megatron_fsdp
139+
env:
140+
PACKAGE: ${{ matrix.PACKAGE }}
141+
steps:
142+
- name: Download wheels
143+
uses: actions/download-artifact@v4
144+
with:
145+
path: dist/
146+
merge-multiple: true
147+
148+
- name: Publish wheels
149+
env:
150+
TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
151+
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
152+
TWINE_REPOSITORY: ${{ (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/r')) && 'pypi' || 'testpypi' }}
153+
run: |
154+
ls -al dist/$PACKAGE*
155+
pip install twine
156+
twine upload -r $TWINE_REPOSITORY -u $TWINE_USERNAME -p $TWINE_PASSWORD dist/$PACKAGE*

0 commit comments

Comments
 (0)