Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
5333 commits
Select commit Hold shift + click to select a range
94a3711
Merge branch 'helenn-ban-expandable-segments' into 'main'
jaredcasper Aug 13, 2025
13fd57a
ADLR/megatron-lm!3710 - MXFP8 DP AG overlap enablement
youngeunkwon0405 Aug 13, 2025
410222b
Merge branch 'mxfp8-dp-ag-overlap-mr' into 'main'
jaredcasper Aug 13, 2025
2f1027d
ci(hotfix): Disable broken tests
ko3n1g Aug 12, 2025
a5f0057
ci(hotfix): Catch WaitTimeExceeded
ko3n1g Aug 14, 2025
cde92b2
ADLR/megatron-lm!3406 - Update README
sbhavani Aug 14, 2025
2dd030e
Merge branch 'update-readme' into 'main'
ko3n1g Aug 14, 2025
d87bfd1
ADLR/megatron-lm!3808 - Move FullCudaGraphWrapper implementation to M…
vasunvidia Aug 15, 2025
91e2ee5
Merge branch 'vrengasamy/full_cuda_graph_core' into 'main'
chtruong814 Aug 15, 2025
2b6b46b
ADLR/megatron-lm!3631 - Fixes and updates for external cudagraph
buptzyb Aug 15, 2025
9545270
Merge branch 'robinz/external_cudagraph_update' into 'main'
chtruong814 Aug 15, 2025
16ad771
ADLR/megatron-lm!3799 - build: Bump TE
ko3n1g Aug 15, 2025
0d33682
Merge branch 'ko3n1g/build/te-2.6' into 'main'
ko3n1g Aug 15, 2025
82e5ff6
ADLR/megatron-lm!3606 - Debug distributed checkpoint for Transformer …
timmoon10 Aug 16, 2025
d1a8777
Merge branch 'tmoon/te-op-fuser-debug-checkpoint' into 'main'
ko3n1g Aug 16, 2025
7704169
ADLR/megatron-lm!3812 - Add argument to control collnet enablement
youngeunkwon0405 Aug 16, 2025
4819438
Merge branch 'ibsharp-knob' into 'main'
deepakn94 Aug 16, 2025
46eb0a3
ADLR/megatron-lm!3569 - Dynamic Backend Inference MLA
wdykas Aug 16, 2025
29a0607
Merge branch 'mla-flash' into 'main'
ko3n1g Aug 16, 2025
b6a7f40
ADLR/megatron-lm!3422 - Adding support for multiple validation sets
Aug 16, 2025
2e88416
Merge branch 'bnorick/multi-validation' into 'main'
ko3n1g Aug 16, 2025
1b07529
ADLR/megatron-lm!3718 - Fix log prob calculation, pipeline parallelis…
santhnm2 Aug 16, 2025
7a31b35
Merge branch 'dynamic_logprobs_fix' into 'main'
ko3n1g Aug 16, 2025
c86819f
ADLR/megatron-lm!3746 - Fix cuda graph with first/last layer bf16 by …
guyueh1 Aug 16, 2025
a100a3c
Merge branch 'fix_cuda_graph_with_first_last_layer_bf16' into 'main'
chtruong814 Aug 16, 2025
d7444d0
ADLR/megatron-lm!3827 - chore: Upgrade dependencies (2025-08-15)
Aug 16, 2025
2106bd6
Merge branch 'ci-bot/build/upgrade-dependencies-2025-08-15' into 'main'
ko3n1g Aug 16, 2025
254ef23
ADLR/megatron-lm!3829 - ci: Add copy-pr-bot
ko3n1g Aug 16, 2025
c769b67
Merge branch 'ko3n1g/ci/copy-pr-bot' into 'main'
ko3n1g Aug 16, 2025
69b65e0
ci(hotfix): Restart on `malloc(): unaligned tcache chunk detected`
ko3n1g Aug 17, 2025
6b62015
ADLR/megatron-lm!3378 - M4 p2p communication, schedules and finalize …
yashaswikarnati Aug 17, 2025
de512dc
Merge branch 'yash/p2p_class' into 'main'
ko3n1g Aug 17, 2025
c08d89b
ADLR/megatron-lm!3809 - feat(moe): Add MoE router fusion
Victarry Aug 18, 2025
d93743a
Merge branch 'denliu/router_fusoin' into 'main'
ko3n1g Aug 18, 2025
79d04be
chore: Version bump
Aug 18, 2025
d3df238
ADLR/megatron-lm!3814 - Apex.contrib.nccl_allocator migration
youngeunkwon0405 Aug 18, 2025
66a1dfc
Merge branch 'remove-apex-nccl-allocator' into 'main'
ko3n1g Aug 18, 2025
8e11c52
chore: Version bump 0.15.0rc0
ko3n1g Aug 18, 2025
551b734
ci: Fix segfaults (maybe)
ko3n1g Aug 17, 2025
f778f7b
ci: DEV tests from A100 to H100 cluster
ko3n1g Aug 15, 2025
781e765
ADLR/megatron-lm!3465 - perf(MoE): Support recomputation for FP8 laye…
hxbai Aug 20, 2025
6850cc6
Merge branch 'hongxiaob/save_original_input' into 'main'
ko3n1g Aug 20, 2025
8653fad
ADLR/megatron-lm!3757 - ZMQ based communication of requests during pa…
sidsingh-nvidia Aug 20, 2025
c47cf0a
Merge branch 'dynamic-inference-parallelism' into 'main'
ko3n1g Aug 20, 2025
8d9dbed
ADLR/megatron-lm!3815 - Add is_cg_capturable flag to CrossEntropyLoss…
katec846 Aug 22, 2025
4cd81e8
Merge branch 'add_is_cg_capturable_flag' into 'main'
jaredcasper Aug 22, 2025
af28b5a
ADLR/megatron-lm!3443 - [FSDP] Decouple Custom FSDP to make it indepe…
shjwudp Aug 22, 2025
4dd2f2b
Merge branch 'nvfsdp_convergence' into 'main'
ko3n1g Aug 22, 2025
237080b
ADLR/megatron-lm!3842 - This fixes the bug where not using full_valid…
Aug 22, 2025
4db3c78
Merge branch 'fix_full_validation_bug' into 'main'
jaredcasper Aug 22, 2025
7139518
ADLR/megatron-lm!3824 - Fix cuda graph when VPP is used
guyueh1 Aug 22, 2025
c6aab54
Merge branch 'fix_cuda_graph_with_vpp' into 'main'
chtruong814 Aug 22, 2025
eb0c03e
ADLR/megatron-lm!3834 - chore: Upgrade dependencies (2025-08-18)
Aug 22, 2025
09ca1d2
Merge branch 'ci-bot/build/upgrade-dependencies-2025-08-18' into 'main'
ko3n1g Aug 22, 2025
d6d094a
ADLR/megatron-lm!3600 - fix sync save utility
aartibasant Aug 22, 2025
bf341cb
Merge branch 'abasant/fix_sync_save_utility_v2' into 'main'
ananthsub Aug 22, 2025
7683298
ADLR/megatron-lm!3719 - Ability to abort persistent CP worker
aartibasant Aug 22, 2025
0913599
Merge branch 'abasant/ability_to_abort_persistent_worker' into 'main'
ananthsub Aug 22, 2025
4d5dc62
ADLR/megatron-lm!3432 - Enable VLM FP8
trintamaki Aug 22, 2025
d9270aa
Merge branch 'trintamaki/vlm_fp8' into 'main'
jaredcasper Aug 22, 2025
2e7e438
ADLR/megatron-lm!2452 - Make it an option to use TE activation functi…
guyueh1 Aug 23, 2025
6299ea7
Merge branch 'te_activation_func' into 'main'
ko3n1g Aug 23, 2025
0532e92
ADLR/megatron-lm!3678 - moonshotai/Kimi-K2-Instruct HF import, PTQ, a…
ChenhanYu Aug 23, 2025
d775d4a
Merge branch 'chenhany/kimi_k2' into 'main'
jaredcasper Aug 23, 2025
724580d
ADLR/megatron-lm!3861 - Remove FP8 calibration script
mathemakitten Aug 23, 2025
bebc0e4
Merge branch 'helenn-delete-calibration' into 'main'
jaredcasper Aug 23, 2025
ed1eaa9
ADLR/megatron-lm!3828 - Fix NEMO unittest where the weight is provide…
ChenhanYu Aug 23, 2025
312f300
Merge branch 'chenjiel/fix_nemo' into 'main'
jaredcasper Aug 23, 2025
c40a446
ADLR/megatron-lm!3864 - add wandb_entity
Aug 24, 2025
f6a675a
Merge branch 'entity' into 'main'
ericharper Aug 24, 2025
3d19693
ADLR/megatron-lm!3532 - Implement new optimizer checkpoint formats fo…
mikolajblaz Aug 24, 2025
3d784cb
Merge branch 'mblaz/dp_zero_model_space' into 'main'
ko3n1g Aug 24, 2025
1c29678
ADLR/megatron-lm!3876 - build: Bump packaging
ko3n1g Aug 24, 2025
f364164
Merge branch 'ko3n1g/ci/packaging' into 'main'
ko3n1g Aug 24, 2025
c7fd91a
ci(hotfix): Increase n_nondeterminism_attemps
ko3n1g Aug 24, 2025
4840669
chore: Version bump
Aug 25, 2025
f8f6e9b
ci(hotfix): Restart on zmq error
ko3n1g Aug 25, 2025
188435a
ci(hotfix): Increase non-determinism attempts
ko3n1g Aug 25, 2025
66c12ce
ADLR/megatron-lm!3887 - chore: Upgrade dependencies (2025-08-25)
Aug 26, 2025
77eaa9a
Merge branch 'ci-bot/build/upgrade-dependencies-2025-08-25' into 'main'
ko3n1g Aug 26, 2025
875ad2a
ADLR/megatron-lm!3875 - ci: Auto-publish megatron-fsdp
ko3n1g Aug 26, 2025
dcf7d36
Merge branch 'ko3n1g/ci/push-megatron-fsdp' into 'main'
ko3n1g Aug 26, 2025
a6c6250
ADLR/megatron-lm!3646 - [1/4] Merge Megatron-RL into LM
tdene Aug 26, 2025
16c0d28
Merge branch 'tdene/push-to-upstream-mr1' into 'main'
jaredcasper Aug 26, 2025
d6526b1
ADLR/megatron-lm!3884 - Fix unsetting NCCL_COLLNET_ENABLE in initiali…
chtruong814 Aug 26, 2025
8db4323
Merge branch 'chtruong/fix-unset-nccl-collnet' into 'main'
chtruong814 Aug 26, 2025
d7bf5aa
ADLR/megatron-lm!3074 - feat(MoE): Support Expert Parallel A2A Overla…
Aug 26, 2025
4b30ec5
Merge branch 'hongbinl/1f1b_overlap_new' into 'main'
ko3n1g Aug 26, 2025
799cee0
ADLR/megatron-lm!3782 - Move cuda graph capture to core
buptzyb Aug 26, 2025
b7a6f90
Merge branch 'robinz/cudagraph_core' into 'main'
ko3n1g Aug 26, 2025
37ee3d1
ADLR/megatron-lm!3764 - tests: Auto-validate weekly tests
ko3n1g Aug 26, 2025
d6301fb
Merge branch 'ko3n1g/tests/thresholds-weekly' into 'main'
ko3n1g Aug 26, 2025
5b2cb28
ci: No integration tests on merge-trains
ko3n1g Aug 27, 2025
7b8bbf2
ci: Allow interrupt on main
ko3n1g Aug 28, 2025
8efa2a0
ci(hotfix): Non-determinism only on EXIT_CODE=0
ko3n1g Aug 28, 2025
6740f5e
ADLR/megatron-lm!3885 - Add proper teardowns for cudagraphs tests to …
mathemakitten Aug 28, 2025
028f079
Merge branch 'helenn-patch-legacy-cudagraph-tests' into 'main'
ko3n1g Aug 28, 2025
4f6ab63
ADLR/megatron-lm!3897 - build: Loosen transformers pin
ko3n1g Aug 28, 2025
7aad147
Merge branch 'ko3n1g/build/transformers-pin' into 'main'
ko3n1g Aug 28, 2025
7ceafd9
ADLR/megatron-lm!3899 - fix: correct way to pass pipeline_rank in tes…
ZhiyuLi-Nvidia Aug 28, 2025
bdad881
Merge branch 'zhiyul/fix_vpp_ci_test' into 'main'
ko3n1g Aug 28, 2025
fcbde8a
ADLR/megatron-lm!3900 - Fix providers refactoring.
yobibyte Aug 28, 2025
d1e4fc6
Merge branch 'vitalyk/fix-textgen' into 'main'
deepakn94 Aug 28, 2025
b74396f
ADLR/megatron-lm!3877 - Fix FSDP distributed parameter weight shapes …
cspades Aug 28, 2025
1d0995d
Merge branch 'cye/fix-fsdp-dist-shape' into 'main'
ko3n1g Aug 28, 2025
500333c
ADLR/megatron-lm!3917 - chore: Add RL review group
ko3n1g Aug 29, 2025
17cb145
Merge branch 'ko3n1g/chore/add-rl-group' into 'main'
ko3n1g Aug 29, 2025
3ec579a
ADLR/megatron-lm!3843 - build: Upgrade TE to 2.7
ko3n1g Aug 29, 2025
1deafac
Merge branch 'ko3n1g/build/te-2.7' into 'main'
ko3n1g Aug 29, 2025
0e3d8ec
ci(hotfix): Restart attempts
ko3n1g Aug 29, 2025
c2527ba
ci(hotfix): Golden values
ko3n1g Aug 29, 2025
fa4d12c
ci(hotfix): yq path
ko3n1g Aug 29, 2025
256c855
ADLR/megatron-lm!3688 - Non-decode CUDA graphs for the dynamic infere…
sidsingh-nvidia Aug 30, 2025
d0d8a5c
Merge branch 'siddharth/non-decode-cg' into 'main'
ko3n1g Aug 30, 2025
180ebf0
ADLR/megatron-lm!3914 - Update README - Latest News
sbhavani Aug 30, 2025
d7ed78d
Merge branch 'main' into 'main'
ko3n1g Aug 30, 2025
28925b8
ADLR/megatron-lm!3928 - ci: Py312 wheels
ko3n1g Aug 30, 2025
a6237d0
Merge branch 'ko3n1g/ci/py312-support' into 'main'
ko3n1g Aug 30, 2025
19db79c
Revert "ADLR/megatron-lm!3688 - Non-decode CUDA graphs for the dynami…
ko3n1g Aug 30, 2025
6f178d2
ADLR/megatron-lm!3741 - Freeze GC around create_cudagraphs().
lmcafee-nvidia Aug 31, 2025
5c05330
Merge branch 'lmcafee/cuda-graph-gc' into 'main'
lmcafee-nvidia Aug 31, 2025
223533f
ADLR/megatron-lm!3555 - Dynamic inference functional tests | ShareGPT
lmcafee-nvidia Aug 31, 2025
f0edd1c
Merge branch 'lmcafee/dyn-inf-functional-tests' into 'main'
lmcafee-nvidia Aug 31, 2025
3271c42
chore: Version bump
Sep 1, 2025
726bc3f
ci(hotfix): Fix `Run tests` label
ko3n1g Sep 1, 2025
2c69af5
ADLR/megatron-lm!3933 - chore: Upgrade dependencies (2025-09-01)
Sep 1, 2025
f4fa7d6
Merge branch 'ci-bot/build/upgrade-dependencies-2025-09-01' into 'main'
ko3n1g Sep 1, 2025
39247c2
ADLR/megatron-lm!3936 - fix typos in megatron/training/arguments.py
sbhavani Sep 1, 2025
5cdac7d
Merge branch 'docs/fix-typos' into 'main'
ko3n1g Sep 1, 2025
6462307
ci(hotfix): Less false-positive restarts
ko3n1g Sep 2, 2025
4025494
chore: Version bump
Sep 2, 2025
d8714d9
chore: Version bump
ko3n1g Sep 2, 2025
84111cc
ADLR/megatron-lm!3942 - ci: Fix release tag
ko3n1g Sep 3, 2025
40af198
Merge branch 'ko3n1g/ci/fix-release' into 'main'
ko3n1g Sep 3, 2025
5cc85f3
ADLR/megatron-lm!3695 - Added double buffering to be configurable
sanandaraj5597 Sep 3, 2025
96f1e01
Merge branch 'optional_double_buffering' into 'main'
ko3n1g Sep 3, 2025
3075197
ADLR/megatron-lm!3869 - Implement fused MLP as subclass of unfused MLP
timmoon10 Sep 3, 2025
122324c
Merge branch 'tmoon/fused-mlp-subclass' into 'main'
ko3n1g Sep 3, 2025
972553d
ADLR/megatron-lm!3902 - Fix: use_decoupled_grad was set to the wrong …
yaoyu-33 Sep 3, 2025
e0b9fbb
Merge branch 'yuya/dist_fused_adam_fix' into 'main'
ko3n1g Sep 3, 2025
a8fee91
ADLR/megatron-lm!3784 - Use isolated RNG for sampling
santhnm2 Sep 3, 2025
0118b97
Merge branch 'pp_eval_debugging' into 'main'
ko3n1g Sep 3, 2025
708f565
ADLR/megatron-lm!3846 - ckpt loading safe strategy
dimapihtar Sep 3, 2025
86660c1
Merge branch 'ckpt_safe_load' into 'main'
ko3n1g Sep 3, 2025
a15a6d4
ADLR/megatron-lm!3893 - chore: update MSC docs and bump version
shunjiad Sep 3, 2025
e674a29
Merge branch 'fix-msc-docs' into 'main'
ko3n1g Sep 3, 2025
1cf07f3
Revert "ADLR/megatron-lm!3784 - Use isolated RNG for sampling"
ko3n1g Sep 3, 2025
1e9e94c
ADLR/megatron-lm!3881 - Megatron-FSDP NCCL symmetric register support
youngeunkwon0405 Sep 3, 2025
2c7e98a
Merge branch 'nvfsdp-symmetric' into 'main'
ericharper Sep 3, 2025
50502b9
ADLR/megatron-lm!3383 - Add KV duplication for TensorRTLLM export to …
meatybobby Sep 3, 2025
9cbd0d7
Merge branch 'bobchen/kv_repeat' into 'main'
ko3n1g Sep 3, 2025
7b49ca7
ADLR/megatron-lm!3652 - Remove `create_cudagraphs` + unify cudagraphs…
mathemakitten Sep 3, 2025
02a1dd0
Merge branch 'helenn-unify-graph-creation-and-recording' into 'main'
ko3n1g Sep 3, 2025
b6517c6
ADLR/megatron-lm!2997 - add new tokenizer system
dimapihtar Sep 4, 2025
cca55cc
Merge branch 'tokenizers' into 'main'
ko3n1g Sep 4, 2025
76622ed
ADLR/megatron-lm!3892 - M4: Merge `model_comm_pgs`, `grad_comm_pgs` a…
yaoyu-33 Sep 4, 2025
0889ce9
Merge branch 'yuya/m4_merge_into_pg_collection' into 'main'
ko3n1g Sep 4, 2025
1b432dd
fix(hotfix): Update golden values
ko3n1g Sep 4, 2025
49be51a
Revert "fix(hotfix): Update golden values"
ko3n1g Sep 4, 2025
3a51832
Revert "ADLR/megatron-lm!2997 - add new tokenizer system"
ko3n1g Sep 4, 2025
ca9797e
Revert "ADLR/megatron-lm!3892 - M4: Merge `model_comm_pgs`, `grad_com…
ko3n1g Sep 4, 2025
3e3c3c9
ci(hotfix): Retry on non-determinism
ko3n1g Sep 4, 2025
9be09c9
ci(hotfix): Disable mimo test
ko3n1g Sep 4, 2025
3a9a060
Replay of: !2997 - add new tokenizer system
ko3n1g Sep 4, 2025
ebc82b1
ADLR/megatron-lm!3953 - ci: Use Team tag for review reminders
ko3n1g Sep 4, 2025
63e8566
Merge branch 'ko3n1g/ci/team-reviews' into 'main'
ko3n1g Sep 4, 2025
0269b8a
ADLR/megatron-lm!3935 - docs: Add Installation Guide
ko3n1g Sep 4, 2025
dd7de1a
Merge branch 'ko3n1g/docs/installation-guide' into 'main'
ko3n1g Sep 4, 2025
f573042
ADLR/megatron-lm!3957 - ci: Auto-validate convergence tests
ko3n1g Sep 5, 2025
e000263
Merge branch 'ko3n1g/ci/auto-validate-release-tests-2' into 'main'
ko3n1g Sep 5, 2025
72d2354
ADLR/megatron-lm!3318 - feat(moe) Add support for global aux loss
Victarry Sep 5, 2025
e58d908
Merge branch 'denliu/global_aux_loss' into 'main'
chtruong814 Sep 5, 2025
641bf8b
ADLR/megatron-lm!3960 - Dynamic inference example | Remove shorten_st…
lmcafee-nvidia Sep 5, 2025
3616ce5
Merge branch 'lmcafee/dyn-inf-remove-shorten-str' into 'main'
lmcafee-nvidia Sep 5, 2025
ba3e244
ADLR/megatron-lm!3924 - Fix typos in tensor parallel layer documentation
sbhavani Sep 5, 2025
07b22a0
Merge branch 'docs/improve-tp-code-comments' into 'main'
deepakn94 Sep 5, 2025
c7a3aa4
ADLR/megatron-lm!3946 - fixing reverted MR 3688
sidsingh-nvidia Sep 6, 2025
230e0cb
Merge branch 'siddharth/non-decode-cg' into 'main'
chtruong814 Sep 6, 2025
a99f647
ADLR/megatron-lm!3943 - FP4 utils for nvfp4 recipe
Sep 6, 2025
020abf0
Merge branch 'fp4_recipe_revamp' into 'main'
chtruong814 Sep 6, 2025
696977f
ADLR/megatron-lm!3964 - Fix megatron-fsdp logging and remove extra co…
BoxiangW Sep 6, 2025
2108cd8
Merge branch 'boxiangw/megatron-fsdp-fix' into 'main'
chtruong814 Sep 6, 2025
85a8340
ADLR/megatron-lm!3867 - nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base NVFP4 …
ChenhanYu Sep 6, 2025
55c1433
Merge branch 'chenhany/nmh-nano-9b-quantize-support' into 'main'
chtruong814 Sep 6, 2025
3230510
ADLR/megatron-lm!3880 - Fix is_first_microbatch not correctly set wit…
Sep 7, 2025
dc68d29
Merge branch 'w_cache_fix' into 'main'
chtruong814 Sep 7, 2025
8c1a3f5
ADLR/megatron-lm!3962 - Replay (!3892) M4: Merge `model_comm_pgs`, `g…
yaoyu-33 Sep 7, 2025
b615e73
Merge branch 'yuya/m4_merge_into_pg_collection' into 'main'
chtruong814 Sep 7, 2025
3a4b71e
ADLR/megatron-lm!3967 - Replay (!3784): Use isolated RNG for sampling
santhnm2 Sep 8, 2025
4af25fe
Merge branch 'pp_eval_debugging' into 'main'
chtruong814 Sep 8, 2025
ba97a7e
chore: Version bump
Sep 8, 2025
a949f69
ADLR/megatron-lm!3975 - Fix textgen server post-3646
mathemakitten Sep 8, 2025
294395c
Merge branch 'helenn-fix-mamba-textgen-server-post3646' into 'main'
jaredcasper Sep 8, 2025
dbc4129
ADLR/megatron-lm!3856 - Sink Attention [gpt-oss 3/5]
cuichenx Sep 8, 2025
5704c92
Merge branch 'chcui/sink-attention' into 'main'
thomasdhc Sep 8, 2025
3a9cff6
ADLR/megatron-lm!3648 - [2/4] Merge Megatron-RL into LM
tdene Sep 9, 2025
ffe2af1
Merge branch 'tdene/push-to-upstream-mr3' into 'main'
chtruong814 Sep 9, 2025
5b75141
ADLR/megatron-lm!3743 - Enable simplified checkpointing
mikolajblaz Sep 9, 2025
4370d3a
Merge branch 'mblaz/complex-ckpt-deprecation' into 'main'
jaredcasper Sep 9, 2025
270a0b4
ADLR/megatron-lm!3979 - Reenable mimo functional test
yashaswikarnati Sep 9, 2025
ecedef0
Merge branch 'yash/fix_mimo_tests' into 'main'
chtruong814 Sep 9, 2025
d44b513
ADLR/megatron-lm!3941 - [Fix] Change TE version to 2.8 when enabling …
Sep 9, 2025
efeb85b
Merge branch 'hongbinl/1f1b_overlap_fix' into 'main'
chtruong814 Sep 9, 2025
136b7f5
ADLR/megatron-lm!3923 - Only build and load data iterators on relevan…
deepakn94 Sep 9, 2025
aacc3b8
Merge branch 'dnarayanan/vpp_fixes' into 'main'
deepakn94 Sep 9, 2025
fb452ba
ADLR/megatron-lm!3804 - [Misc] Fix several warnings
yaox12 Sep 9, 2025
25b8af1
Merge branch 'xiny/fix_warning' into 'main'
thomasdhc Sep 9, 2025
2e29a5e
ADLR/megatron-lm!3857 - Add quick geglu activation for gpt-oss [gpt-o…
cuichenx Sep 9, 2025
de48755
Merge branch 'chcui/quick-geglu' into 'main'
chtruong814 Sep 9, 2025
1aef9f8
ADLR/megatron-lm!3961 - Protect against divide by 0 when all tokens m…
jstjohn Sep 9, 2025
83609d7
Merge branch 'jstjohn/clamp_num_tokens' into 'main'
chtruong814 Sep 9, 2025
746c913
ADLR/megatron-lm!3963 - SFT chat template and pad token changes for N…
ameyasm1154 Sep 9, 2025
1e0cb14
Merge branch 'ameyasunilm/nano-v2-12b-sft' into 'main'
deepakn94 Sep 9, 2025
4971290
ADLR/megatron-lm!3983 - Update Bert and T5 nightly golden values
chtruong814 Sep 10, 2025
bc663b1
Merge branch 'chtruong/bert-t5-nightly' into 'main'
chtruong814 Sep 10, 2025
90f39a3
ADLR/megatron-lm!3918 - Fix bug in param_norm computation where some …
deepakn94 Sep 10, 2025
5e8c9c4
Merge branch 'dnarayanan/param_norm_fixes' into 'main'
deepakn94 Sep 10, 2025
18420b6
ADLR/megatron-lm!3993 - Fix BERT + virtual pipeline parallelism
deepakn94 Sep 10, 2025
1dc7019
Merge branch 'dnarayanan/fix_bert_vpp' into 'main'
jaredcasper Sep 10, 2025
9a4002e
ADLR/megatron-lm!3620 - Dynamic inference functional tests | Cuda gra…
lmcafee-nvidia Sep 10, 2025
159a6a0
Merge branch 'lmcafee/dyn-inf-functional-tests-cuda-graph' into 'main'
lmcafee-nvidia Sep 10, 2025
2d60db7
ADLR/megatron-lm!3965 - Create inference graphs immediately but defer…
mathemakitten Sep 10, 2025
8ed1e2c
Merge branch 'helenn-create-graphs-immediately-inference-only' into '…
thomasdhc Sep 10, 2025
6023444
ADLR/megatron-lm!3995 - Set mimo_vlm and gpt_dynamic_inference tests …
chtruong814 Sep 11, 2025
1584dca
Merge branch 'chtruong/flaky-test' into 'main'
chtruong814 Sep 11, 2025
2fbece5
chore: Version bump
Sep 15, 2025
bdf57ae
ci(hotfix): Notify release
ko3n1g Sep 15, 2025
1c0eb4a
ci(hotfix): Publish
ko3n1g Sep 15, 2025
a43b818
ci(hotfix): Publish
ko3n1g Sep 15, 2025
1e6f75a
ADLR/megatron-lm!3998 - chore: Upgrade dependencies (2025-09-15)
Sep 15, 2025
c76ed86
Merge branch 'ci-bot/build/upgrade-dependencies-2025-09-15' into 'main'
ko3n1g Sep 15, 2025
ef5e03c
ADLR/megatron-lm!3938 - Fix typos in pretrain_mamba.py
sbhavani Sep 15, 2025
848c8c9
ADLR/megatron-lm!3805 - Dynamic inference engine | Events
lmcafee-nvidia Sep 15, 2025
84e9c3a
ADLR/megatron-lm!3323 - fix: prevent an integer overflow on numpy >= 2
rostan-t Sep 15, 2025
23d2ada
ADLR/megatron-lm!3987 - Add megatron.core.rerun_state_machine.RerunSt…
chtruong814 Sep 15, 2025
8399280
ADLR/megatron-lm!4000 - [3/4] Merge Megatron-RL into LM
tdene Sep 16, 2025
2ebb6ee
ADLR/megatron-lm!3931 - fix: Fix non TE optimizer ckpt issue
BoxiangW Sep 16, 2025
6c666b6
ADLR/megatron-lm!3911 - Support ModelOpt EAGLE refactorization and of…
yeyu-nvidia Sep 16, 2025
15a0d47
ADLR/megatron-lm!4011 - ci: Don't run legacy tests on release branch
ko3n1g Sep 16, 2025
a3f9e56
ADLR/megatron-lm!4003 - Fix gc.freeze() slowdown: Add a gc.collect(),…
mathemakitten Sep 16, 2025
5653514
ADLR/megatron-lm!3992 - [Flux] Remove Redundant Host & Device Sync In…
alpha0422 Sep 18, 2025
9f72f47
ADLR/megatron-lm!4018 - bugfix: Added support for FSDP grad accum fusion
sanandaraj5597 Sep 18, 2025
199113b
ADLR/megatron-lm!3879 - Add unit tests, and refactor fully_shard into…
cspades Sep 18, 2025
5a58976
ADLR/megatron-lm!4022 - Add ModelOpt pruning example
kevalmorabia97 Sep 18, 2025
74bec5b
ADLR/megatron-lm!3859 - Gradient comparison test
jstjohn Sep 18, 2025
93a0d8e
ADLR/megatron-lm!4033 - ci: Don't publish protected branches
ko3n1g Sep 19, 2025
fa5082f
ADLR/megatron-lm!4013 - Undo deletion of nemotron_h_aligned template
tdene Sep 19, 2025
c223178
ADLR/megatron-lm!3991 - Fix the print err when torch is not intialized
gdengk Sep 19, 2025
e5bc924
ADLR/megatron-lm!3855 - Enabling mixing SWA with full attention [gpt-…
cuichenx Sep 19, 2025
8479eb3
ADLR/megatron-lm!4019 - [Flux] Fix Full Iter CUDA Graph Issues
alpha0422 Sep 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 0 additions & 5 deletions .coveragerc

This file was deleted.

4 changes: 4 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[flake8]
max-line-length = 100
extend-ignore = E203,E501,F401,E402,E714
per-file-ignores = __init__.py:F401
32 changes: 32 additions & 0 deletions .github/ISSUE_TEMPLATE/bug.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
name: BUG
about: Report a bug that needs attention
title: "[BUG]"
labels: ''
assignees: ''

---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Expected behavior**
A clear and concise description of what you expected to happen.

**Stack trace/logs**
If applicable, add the stack trace or logs from the time of the error.

**Environment (please complete the following information):**
- Megatron-LM commit ID
- PyTorch version
- CUDA version
- NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
23 changes: 23 additions & 0 deletions .github/ISSUE_TEMPLATE/enhancement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
name: ENHANCEMENT
about: Suggest an idea to improve this project
title: "[ENHANCEMENT]"
labels: ''
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Proposed implementation**
If you have a proposed implementation for the feature state it here or link to a PR.

**Additional context**
Add any other context or screenshots about the feature request here.
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: QUESTION
about: Ask a question about Megatron-LM that is not a bug, regression or enhancement
request
title: "[QUESTION]"
labels: ''
assignees: ''

---

**Your question**
Ask a clear and concise question about Megatron-LM.
39 changes: 39 additions & 0 deletions .github/ISSUE_TEMPLATE/regression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
name: REGRESSION
about: Report a regression in speed or accuracy due to a Megatron-LM update
title: "[REGRESSION]"
labels: ''
assignees: ''

---

**Describe the regression**
A clear and concise description of what the regression is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Previous performance**
What speed or accuracy did you previously see.

**New performance**
What speed or accuracy do you see after the update.

**Stack trace/logs**
If applicable, add the stack trace or logs related to the regression.

**Environment (please complete the following information):**
- Previous Megatron-LM commit ID
- New Megatron-LM commit ID
- Previous PyTorch version
- New PyTorch version
- Previous CUDA version
- New CUDA version
- Previous NCCL version
- New NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
3 changes: 3 additions & 0 deletions .github/copy-pr-bot.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
enabled: true
auto_sync_draft: false
auto_sync_ready: true
8 changes: 8 additions & 0 deletions .github/workflows/close-inactive-issue-pr.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
name: Stale-Close-Inactive-Issues-PRs
on:
schedule:
- cron: "30 1 * * *"

jobs:
close-issues:
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/[email protected]
13 changes: 13 additions & 0 deletions .github/workflows/community-bot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: Community Bot

on:
issues:
types: [opened, edited, reopened, closed, deleted]
issue_comment:
types: [created, edited, deleted]

jobs:
community-bot:
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/[email protected]
secrets:
GH_TOKEN: ${{ secrets.PAT }}
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,13 @@ build
*~
slurm*
logs
.vscode
local/
.gitmodules
wandb/
onelogger.log
onelogger.err
.venv
runs/
/test_cases/
**/dist/
Loading