Skip to content

Conversation

@tdene
Copy link
Contributor

@tdene tdene commented Oct 31, 2025

What does this PR do ?

Add extra RL files

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share discuss a design-doc with the team.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]
Loading

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

  1. Attach the Expert Review label when your PR is ready for review.
  2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

  1. Add Final Review label
  2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either [email protected] or [email protected].

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

@tdene tdene requested a review from a team as a code owner October 31, 2025 20:41
@ko3n1g ko3n1g added this to the Core 0.16 milestone Oct 31, 2025
Copy link
Contributor

@ArEsKay3 ArEsKay3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 31, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@tdene tdene enabled auto-merge October 31, 2025 21:13
@tdene
Copy link
Contributor Author

tdene commented Oct 31, 2025

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 31, 2025

/ok to test

@tdene, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@tdene
Copy link
Contributor Author

tdene commented Oct 31, 2025

/ok to test 0c14bc6

@ko3n1g ko3n1g added this pull request to the merge queue Nov 2, 2025
Merged via the queue into NVIDIA:main with commit dc7a0ca Nov 2, 2025
45 checks passed
Jianbing-D pushed a commit to Jianbing-D/Megatron-LM that referenced this pull request Nov 12, 2025
* ci: Move test optimizer into its own bucket (NVIDIA#1909)

Signed-off-by: oliver könig <[email protected]>

* ci: Use matrix for approval-bot

Signed-off-by: oliver könig <[email protected]>

* ci: Update function name

Signed-off-by: oliver könig <[email protected]>

* ci: Adjust approval-bot for copy-pr-bot

Signed-off-by: oliver könig <[email protected]>

* ci: Parametrize workflow

Signed-off-by: oliver könig <[email protected]>

* ci: Parametrize workflow

Signed-off-by: oliver könig <[email protected]>

* ci: Remove attribute

Signed-off-by: oliver könig <[email protected]>

* ci: Update container image tag to use GitHub SHA

* chore: Remove file

* ci: Fix approval bot

Signed-off-by: oliver könig <[email protected]>

* ci: Configure cherrypick bot (NVIDIA#1925)

Signed-off-by: oliver könig <[email protected]>

* Ci approve dev (NVIDIA#1933)

Signed-off-by: oliver könig <[email protected]>

* ci: Update nightly schedule (NVIDIA#1934)

Signed-off-by: oliver könig <[email protected]>

* ci: Bump pre-flight for runs on main/dev (NVIDIA#1935)

Signed-off-by: oliver könig <[email protected]>

* ci: Allow skipping on main (NVIDIA#1936)

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/ci/pr template community bot (NVIDIA#1937)

* ci: More granular unit tests buckets (NVIDIA#1932)

Signed-off-by: oliver könig <[email protected]>

* Add sequence packing to RL (NVIDIA#1911)

Add sequence packing to RL

* chore: Update template (NVIDIA#1939)

Signed-off-by: oliver könig <[email protected]>

* chore: Add description about who can merge (NVIDIA#1940)

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/ci/fix main on eos (NVIDIA#1938)

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/ci/internal mrs (NVIDIA#1942)

Signed-off-by: oliver könig <[email protected]>

* ci: Fix branch of approval bot (NVIDIA#1944)

Signed-off-by: oliver könig <[email protected]>

* ci: Approvalbot for other branches (NVIDIA#1947)

Signed-off-by: oliver könig <[email protected]>

* ci(fix): Approval bot (NVIDIA#1949)

Signed-off-by: oliver könig <[email protected]>

* ci(fix): Approval gate

Signed-off-by: oliver könig <[email protected]>

* ci: Approval gate rule

Signed-off-by: oliver könig <[email protected]>

* ci: Update golden values nightly

Signed-off-by: oliver könig <[email protected]>

* ci: Approval gate

Signed-off-by: oliver könig <[email protected]>

* ci: Approval bot

Signed-off-by: oliver könig <[email protected]>

* ci: Sync branches

Signed-off-by: oliver könig <[email protected]>

* ci: Smaller image

Signed-off-by: oliver könig <[email protected]>

* ci: Better output

Signed-off-by: oliver könig <[email protected]>

* ci: sync branches

Signed-off-by: oliver könig <[email protected]>

* ci: Fix sync bot

Signed-off-by: oliver könig <[email protected]>

* ci: Finalize

Signed-off-by: oliver könig <[email protected]>

* ci: Finalize

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/ci/sync branches (NVIDIA#1956)

Signed-off-by: oliver könig <[email protected]>

* ci: Increase time limit for main tests

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/ci/add milestone (NVIDIA#1951)

Signed-off-by: oliver könig <[email protected]>

* Remove M-FSDP testing under LTS environment (NVIDIA#1959)

* ci: Run on push to release branch (NVIDIA#1960)

Signed-off-by: oliver könig <[email protected]>

* ci: Add golden values for inference

Signed-off-by: oliver könig <[email protected]>

* Fix typo in rl section of CODEOWNERS (NVIDIA#1968)

* ci: Update copyright checker (NVIDIA#1973)

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/ci/auto reminder GitHub (NVIDIA#1955)

Signed-off-by: oliver könig <[email protected]>

* ci: Update secret

Signed-off-by: oliver könig <[email protected]>

* ci(fix): `Run tests` label (NVIDIA#1970)

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Disable tests again

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Add merge-group to copyright check

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Copyright check on merge-queue

Signed-off-by: oliver könig <[email protected]>

* zarr soft deprecation (NVIDIA#2004)

Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Make `get_asyncio_loop` safe to use repeatedly (NVIDIA#1990)

Co-authored-by: oliver könig <[email protected]>

* Update symmetric registration interface to sync-up with upstream pytorch change (NVIDIA#1924)

Signed-off-by: Youngeun Kwon <[email protected]>
Signed-off-by: Youngeun <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* chore: Update codeowners (NVIDIA#2012)

Signed-off-by: oliver könig <[email protected]>

* Deduplicate dynamic engine + coordinator. (NVIDIA#1981)

Co-authored-by: Mcore Bot <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Safely access state dict args in load ckpt (NVIDIA#1957)

Signed-off-by: Maanu Grover <[email protected]>

* Allow mixed-batch sampling in dynamic inference (NVIDIA#1927)

* Stop Nemo_CICD_Test from failing in forks (NVIDIA#2024)

* Clean up dynamic inference step (NVIDIA#1992)

Co-authored-by: Lawrence McAfee <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* ci: Auto-update copy-pr-bot vetters (NVIDIA#1850)

Signed-off-by: oliver könig <[email protected]>
Co-authored-by: AJ Schmidt <[email protected]>

* Have datasets account for tokenizers which incorrectly define PAD (NVIDIA#2017)

* ci: Enable integration tests (NVIDIA#2023)

Signed-off-by: oliver könig <[email protected]>

* ci: Fix build-push-wheel workflow (NVIDIA#2022)

Signed-off-by: oliver könig <[email protected]>

* chore: Update tooling for interactive jobs (NVIDIA#2032)

Signed-off-by: oliver könig <[email protected]>

* revert(hotfix): ci: trustees_override (NVIDIA#2041)

Signed-off-by: oliver könig <[email protected]>

* add missing warnings import in model parallel config (NVIDIA#2039)

Signed-off-by: ykarnati <[email protected]>

* Reduce-scatter implementation with FP32 accumulation (NVIDIA#1967)

Signed-off-by: Deepak Narayanan <[email protected]>

* ci(fix): Workflows on `main` (NVIDIA#2045)

Signed-off-by: oliver könig <[email protected]>

* build: Bump modelopt (NVIDIA#2046)

Signed-off-by: oliver könig <[email protected]>

* Remove TestCaptureFreezeGC unit test. (NVIDIA#1978)

* ci: Add multi-approval action (NVIDIA#2051)

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Repair codeowners file

* ci(hotfix): Set docs allowed to fail

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/ci/test iteration time (NVIDIA#2067)

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Remove performance for ckpt-resume

Signed-off-by: oliver könig <[email protected]>

* Allow inference test throughput to vary by 10% (NVIDIA#2070)

* ci(hotfix): Inference test pipeline

Signed-off-by: oliver könig <[email protected]>

* chore: Fix autoformatter (NVIDIA#2073)

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Remove iteration-time from t5

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): disable inference test

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Disable inference test

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Bypass approvalbot in merge-queue (NVIDIA#2082)

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Enable merge-group for approval bot

Signed-off-by: oliver könig <[email protected]>

* chore: Update local tooling (NVIDIA#2066)

Signed-off-by: oliver könig <[email protected]>

* Add extra RL files (NVIDIA#2077)

Co-authored-by: Robert Kirby <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Prevent summary jobs from running in forks (NVIDIA#2083)

Co-authored-by: oliver könig <[email protected]>

* ci: Fix test scope (NVIDIA#2091)

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Remove publish workflows

Signed-off-by: oliver könig <[email protected]>

* Refactor the attention metadata into separate classes (NVIDIA#2001)

Co-authored-by: Siddharth Singh <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Guard against incorrectly using MoE prefill graphs (NVIDIA#2030)

Co-authored-by: oliver könig <[email protected]>

* Revert "Refactor the attention metadata into separate classes (NVIDIA#2001)"

This reverts commit a652e2c.

* Run mr-slim tests in lightweight-mode (NVIDIA#2106)

Signed-off-by: Charlie Truong <[email protected]>

* Inference | Lazy compile UVM allocator. (NVIDIA#1977)

Co-authored-by: Mcore Bot <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* chore: Reenable trustees (NVIDIA#2108)

Signed-off-by: oliver könig <[email protected]>

* Revert "Inference | Lazy compile UVM allocator. (NVIDIA#1977)"

This reverts commit 7487c53.

* ci(fix): Changeset of copyright checker (NVIDIA#2110)

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/chore/update release settings (NVIDIA#2097)

Signed-off-by: oliver könig <[email protected]>

* Remove unnecessary check on rotary_pos_cos (NVIDIA#2003)

Signed-off-by: Keshav Santhanam <[email protected]>

* (Reverted) Inference | Lazy compile UVM allocator. (NVIDIA#2125)

Co-authored-by: Mcore Bot <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Refactor Attention Metadata to Separate Classes (NVIDIA#2112)

Co-authored-by: Siddharth Singh <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Refactor model_provider to model_builder format for ModelOpt examples (NVIDIA#2107)

* wandb Inference stats logging (NVIDIA#2026)

Co-authored-by: root <[email protected]>
Co-authored-by: William Dykas <[email protected]>
Co-authored-by: root <[email protected]>

* Make `PipelineParallelLayout` always return str from ` __repr__` (NVIDIA#2055)

Signed-off-by: Ananth Subramaniam <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Add flash_attn_3 as first option for FA3 import (NVIDIA#2010)

Signed-off-by: Keshav Santhanam <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: Teodor-Dumitru Ene <[email protected]>

* Add debugging hint for case when cudagraphs are created but no matching runner is found (NVIDIA#2129)

* ci: LTS container (NVIDIA#2133)

Signed-off-by: oliver könig <[email protected]>

* Revert "ci: LTS container (NVIDIA#2133)"

This reverts commit eb48e81.

* Fix param init (NVIDIA#2033)

Signed-off-by: Chen Cui <[email protected]>

* Hotfix to unit tests on hopper FA3 (NVIDIA#2143)

* Add BytesIO to safe_globals (NVIDIA#2074)

* add deprecation warning for legacy tokenizer system (NVIDIA#2145)

Signed-off-by: dimapihtar <[email protected]>

* replay: ci: Bump LTS container (NVIDIA#2157)

Signed-off-by: oliver könig <[email protected]>

* Hotfix to unit tests on hopper FA3 (bis) (NVIDIA#2179)

* Fix has_modelopt_state() for native Torch checkpoint format (NVIDIA#2160)

Signed-off-by: Asha Anoosheh <[email protected]>

* chore: Remove codeowners (NVIDIA#2175)

Signed-off-by: oliver könig <[email protected]>

* Fix FP8 inference with sequence parallelism (NVIDIA#2009)

Signed-off-by: Keshav Santhanam <[email protected]>
Co-authored-by: Teodor-Dumitru Ene <[email protected]>

* Replace ModelOpt generation server (NVIDIA#2147)

Signed-off-by: Asha Anoosheh <[email protected]>

* Add hybrid model support for dynamic inference engine (NVIDIA#1907)

Signed-off-by: Keshav Santhanam <[email protected]>
Co-authored-by: Teodor-Dumitru Ene <[email protected]>

* Async task and event loop safety in Megatron Core (NVIDIA#2025)

Co-authored-by: Robert Kirby <[email protected]>

* Rename skip_prompt_log_probs (NVIDIA#2181)

* Dynamic inference context | UVM only. (NVIDIA#1983)

Co-authored-by: Robert Kirby <[email protected]>
Co-authored-by: Mcore Bot <[email protected]>
Co-authored-by: Teodor-Dumitru Ene <[email protected]>

* Update copy-pr-bot.yaml [skip ci]

Signed-off-by: oliver könig <[email protected]>

* Revert "Dynamic inference context | UVM only. (NVIDIA#1983)"

This reverts commit d6979d6.

* ci: Run `auto-update-copy-pr-bot` only on forks (NVIDIA#2191)

Signed-off-by: oliver könig <[email protected]>

* Inference throughput tests: refactor goldens to be in list format (NVIDIA#2072)

* Enable TE custom quantization recipe (NVIDIA#2005)

Signed-off-by: Evgeny <[email protected]>
Signed-off-by: root <Evgeny>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: root <Evgeny>

* Remove redundant logits calculations in gpt_model

---------

Signed-off-by: oliver könig <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Youngeun Kwon <[email protected]>
Signed-off-by: Youngeun <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: ykarnati <[email protected]>
Signed-off-by: Deepak Narayanan <[email protected]>
Signed-off-by: Charlie Truong <[email protected]>
Signed-off-by: Keshav Santhanam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Asha Anoosheh <[email protected]>
Signed-off-by: Evgeny <[email protected]>
Signed-off-by: root <Evgeny>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: Teodor-Dumitru Ene <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Youngeun Kwon <[email protected]>
Co-authored-by: Lawrence McAfee <[email protected]>
Co-authored-by: Mcore Bot <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Co-authored-by: Lawrence McAfee <[email protected]>
Co-authored-by: AJ Schmidt <[email protected]>
Co-authored-by: Yashaswi Karnati <[email protected]>
Co-authored-by: Deepak Narayanan <[email protected]>
Co-authored-by: helen ngo <[email protected]>
Co-authored-by: Robert Kirby <[email protected]>
Co-authored-by: kanz-nv <[email protected]>
Co-authored-by: Siddharth Singh <[email protected]>
Co-authored-by: Charlie Truong <[email protected]>
Co-authored-by: Keshav Santhanam <[email protected]>
Co-authored-by: Asha Anoosheh <[email protected]>
Co-authored-by: wdykas <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: William Dykas <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Ananth Subramaniam <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Co-authored-by: Teodor-Dumitru Ene <[email protected]>
Co-authored-by: Robert Kirby <[email protected]>
Co-authored-by: Evgeny Tsykunov <[email protected]>
Jianbing-D pushed a commit to Jianbing-D/Megatron-LM that referenced this pull request Nov 12, 2025
* ci: Move test optimizer into its own bucket (NVIDIA#1909)

Signed-off-by: oliver könig <[email protected]>

* ci: Use matrix for approval-bot

Signed-off-by: oliver könig <[email protected]>

* ci: Update function name

Signed-off-by: oliver könig <[email protected]>

* ci: Adjust approval-bot for copy-pr-bot

Signed-off-by: oliver könig <[email protected]>

* ci: Parametrize workflow

Signed-off-by: oliver könig <[email protected]>

* ci: Parametrize workflow

Signed-off-by: oliver könig <[email protected]>

* ci: Remove attribute

Signed-off-by: oliver könig <[email protected]>

* ci: Update container image tag to use GitHub SHA

* chore: Remove file

* ci: Fix approval bot

Signed-off-by: oliver könig <[email protected]>

* ci: Configure cherrypick bot (NVIDIA#1925)

Signed-off-by: oliver könig <[email protected]>

* Ci approve dev (NVIDIA#1933)

Signed-off-by: oliver könig <[email protected]>

* ci: Update nightly schedule (NVIDIA#1934)

Signed-off-by: oliver könig <[email protected]>

* ci: Bump pre-flight for runs on main/dev (NVIDIA#1935)

Signed-off-by: oliver könig <[email protected]>

* ci: Allow skipping on main (NVIDIA#1936)

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/ci/pr template community bot (NVIDIA#1937)

* ci: More granular unit tests buckets (NVIDIA#1932)

Signed-off-by: oliver könig <[email protected]>

* Add sequence packing to RL (NVIDIA#1911)

Add sequence packing to RL

* chore: Update template (NVIDIA#1939)

Signed-off-by: oliver könig <[email protected]>

* chore: Add description about who can merge (NVIDIA#1940)

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/ci/fix main on eos (NVIDIA#1938)

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/ci/internal mrs (NVIDIA#1942)

Signed-off-by: oliver könig <[email protected]>

* ci: Fix branch of approval bot (NVIDIA#1944)

Signed-off-by: oliver könig <[email protected]>

* ci: Approvalbot for other branches (NVIDIA#1947)

Signed-off-by: oliver könig <[email protected]>

* ci(fix): Approval bot (NVIDIA#1949)

Signed-off-by: oliver könig <[email protected]>

* ci(fix): Approval gate

Signed-off-by: oliver könig <[email protected]>

* ci: Approval gate rule

Signed-off-by: oliver könig <[email protected]>

* ci: Update golden values nightly

Signed-off-by: oliver könig <[email protected]>

* ci: Approval gate

Signed-off-by: oliver könig <[email protected]>

* ci: Approval bot

Signed-off-by: oliver könig <[email protected]>

* ci: Sync branches

Signed-off-by: oliver könig <[email protected]>

* ci: Smaller image

Signed-off-by: oliver könig <[email protected]>

* ci: Better output

Signed-off-by: oliver könig <[email protected]>

* ci: sync branches

Signed-off-by: oliver könig <[email protected]>

* ci: Fix sync bot

Signed-off-by: oliver könig <[email protected]>

* ci: Finalize

Signed-off-by: oliver könig <[email protected]>

* ci: Finalize

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/ci/sync branches (NVIDIA#1956)

Signed-off-by: oliver könig <[email protected]>

* ci: Increase time limit for main tests

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/ci/add milestone (NVIDIA#1951)

Signed-off-by: oliver könig <[email protected]>

* Remove M-FSDP testing under LTS environment (NVIDIA#1959)

* ci: Run on push to release branch (NVIDIA#1960)

Signed-off-by: oliver könig <[email protected]>

* ci: Add golden values for inference

Signed-off-by: oliver könig <[email protected]>

* Fix typo in rl section of CODEOWNERS (NVIDIA#1968)

* ci: Update copyright checker (NVIDIA#1973)

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/ci/auto reminder GitHub (NVIDIA#1955)

Signed-off-by: oliver könig <[email protected]>

* ci: Update secret

Signed-off-by: oliver könig <[email protected]>

* ci(fix): `Run tests` label (NVIDIA#1970)

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Disable tests again

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Add merge-group to copyright check

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Copyright check on merge-queue

Signed-off-by: oliver könig <[email protected]>

* zarr soft deprecation (NVIDIA#2004)

Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Make `get_asyncio_loop` safe to use repeatedly (NVIDIA#1990)

Co-authored-by: oliver könig <[email protected]>

* Update symmetric registration interface to sync-up with upstream pytorch change (NVIDIA#1924)

Signed-off-by: Youngeun Kwon <[email protected]>
Signed-off-by: Youngeun <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* chore: Update codeowners (NVIDIA#2012)

Signed-off-by: oliver könig <[email protected]>

* Deduplicate dynamic engine + coordinator. (NVIDIA#1981)

Co-authored-by: Mcore Bot <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Safely access state dict args in load ckpt (NVIDIA#1957)

Signed-off-by: Maanu Grover <[email protected]>

* Allow mixed-batch sampling in dynamic inference (NVIDIA#1927)

* Stop Nemo_CICD_Test from failing in forks (NVIDIA#2024)

* Clean up dynamic inference step (NVIDIA#1992)

Co-authored-by: Lawrence McAfee <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* ci: Auto-update copy-pr-bot vetters (NVIDIA#1850)

Signed-off-by: oliver könig <[email protected]>
Co-authored-by: AJ Schmidt <[email protected]>

* Have datasets account for tokenizers which incorrectly define PAD (NVIDIA#2017)

* ci: Enable integration tests (NVIDIA#2023)

Signed-off-by: oliver könig <[email protected]>

* ci: Fix build-push-wheel workflow (NVIDIA#2022)

Signed-off-by: oliver könig <[email protected]>

* chore: Update tooling for interactive jobs (NVIDIA#2032)

Signed-off-by: oliver könig <[email protected]>

* revert(hotfix): ci: trustees_override (NVIDIA#2041)

Signed-off-by: oliver könig <[email protected]>

* add missing warnings import in model parallel config (NVIDIA#2039)

Signed-off-by: ykarnati <[email protected]>

* Reduce-scatter implementation with FP32 accumulation (NVIDIA#1967)

Signed-off-by: Deepak Narayanan <[email protected]>

* ci(fix): Workflows on `main` (NVIDIA#2045)

Signed-off-by: oliver könig <[email protected]>

* build: Bump modelopt (NVIDIA#2046)

Signed-off-by: oliver könig <[email protected]>

* Remove TestCaptureFreezeGC unit test. (NVIDIA#1978)

* ci: Add multi-approval action (NVIDIA#2051)

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Repair codeowners file

* ci(hotfix): Set docs allowed to fail

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/ci/test iteration time (NVIDIA#2067)

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Remove performance for ckpt-resume

Signed-off-by: oliver könig <[email protected]>

* Allow inference test throughput to vary by 10% (NVIDIA#2070)

* ci(hotfix): Inference test pipeline

Signed-off-by: oliver könig <[email protected]>

* chore: Fix autoformatter (NVIDIA#2073)

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Remove iteration-time from t5

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): disable inference test

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Disable inference test

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Bypass approvalbot in merge-queue (NVIDIA#2082)

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Enable merge-group for approval bot

Signed-off-by: oliver könig <[email protected]>

* chore: Update local tooling (NVIDIA#2066)

Signed-off-by: oliver könig <[email protected]>

* Add extra RL files (NVIDIA#2077)

Co-authored-by: Robert Kirby <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Prevent summary jobs from running in forks (NVIDIA#2083)

Co-authored-by: oliver könig <[email protected]>

* ci: Fix test scope (NVIDIA#2091)

Signed-off-by: oliver könig <[email protected]>

* ci(hotfix): Remove publish workflows

Signed-off-by: oliver könig <[email protected]>

* Refactor the attention metadata into separate classes (NVIDIA#2001)

Co-authored-by: Siddharth Singh <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Guard against incorrectly using MoE prefill graphs (NVIDIA#2030)

Co-authored-by: oliver könig <[email protected]>

* Revert "Refactor the attention metadata into separate classes (NVIDIA#2001)"

This reverts commit a652e2c.

* Run mr-slim tests in lightweight-mode (NVIDIA#2106)

Signed-off-by: Charlie Truong <[email protected]>

* Inference | Lazy compile UVM allocator. (NVIDIA#1977)

Co-authored-by: Mcore Bot <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* chore: Reenable trustees (NVIDIA#2108)

Signed-off-by: oliver könig <[email protected]>

* Revert "Inference | Lazy compile UVM allocator. (NVIDIA#1977)"

This reverts commit 7487c53.

* ci(fix): Changeset of copyright checker (NVIDIA#2110)

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/chore/update release settings (NVIDIA#2097)

Signed-off-by: oliver könig <[email protected]>

* Remove unnecessary check on rotary_pos_cos (NVIDIA#2003)

Signed-off-by: Keshav Santhanam <[email protected]>

* (Reverted) Inference | Lazy compile UVM allocator. (NVIDIA#2125)

Co-authored-by: Mcore Bot <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Refactor Attention Metadata to Separate Classes (NVIDIA#2112)

Co-authored-by: Siddharth Singh <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Refactor model_provider to model_builder format for ModelOpt examples (NVIDIA#2107)

* wandb Inference stats logging (NVIDIA#2026)

Co-authored-by: root <[email protected]>
Co-authored-by: William Dykas <[email protected]>
Co-authored-by: root <[email protected]>

* Make `PipelineParallelLayout` always return str from ` __repr__` (NVIDIA#2055)

Signed-off-by: Ananth Subramaniam <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Add flash_attn_3 as first option for FA3 import (NVIDIA#2010)

Signed-off-by: Keshav Santhanam <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: Teodor-Dumitru Ene <[email protected]>

* Add debugging hint for case when cudagraphs are created but no matching runner is found (NVIDIA#2129)

* ci: LTS container (NVIDIA#2133)

Signed-off-by: oliver könig <[email protected]>

* Revert "ci: LTS container (NVIDIA#2133)"

This reverts commit eb48e81.

* Fix param init (NVIDIA#2033)

Signed-off-by: Chen Cui <[email protected]>

* Hotfix to unit tests on hopper FA3 (NVIDIA#2143)

* Add BytesIO to safe_globals (NVIDIA#2074)

* add deprecation warning for legacy tokenizer system (NVIDIA#2145)

Signed-off-by: dimapihtar <[email protected]>

* replay: ci: Bump LTS container (NVIDIA#2157)

Signed-off-by: oliver könig <[email protected]>

* Hotfix to unit tests on hopper FA3 (bis) (NVIDIA#2179)

* Fix has_modelopt_state() for native Torch checkpoint format (NVIDIA#2160)

Signed-off-by: Asha Anoosheh <[email protected]>

* chore: Remove codeowners (NVIDIA#2175)

Signed-off-by: oliver könig <[email protected]>

* Fix FP8 inference with sequence parallelism (NVIDIA#2009)

Signed-off-by: Keshav Santhanam <[email protected]>
Co-authored-by: Teodor-Dumitru Ene <[email protected]>

* Replace ModelOpt generation server (NVIDIA#2147)

Signed-off-by: Asha Anoosheh <[email protected]>

* Add hybrid model support for dynamic inference engine (NVIDIA#1907)

Signed-off-by: Keshav Santhanam <[email protected]>
Co-authored-by: Teodor-Dumitru Ene <[email protected]>

* Async task and event loop safety in Megatron Core (NVIDIA#2025)

Co-authored-by: Robert Kirby <[email protected]>

* Rename skip_prompt_log_probs (NVIDIA#2181)

* Dynamic inference context | UVM only. (NVIDIA#1983)

Co-authored-by: Robert Kirby <[email protected]>
Co-authored-by: Mcore Bot <[email protected]>
Co-authored-by: Teodor-Dumitru Ene <[email protected]>

* Update copy-pr-bot.yaml [skip ci]

Signed-off-by: oliver könig <[email protected]>

* Revert "Dynamic inference context | UVM only. (NVIDIA#1983)"

This reverts commit d6979d6.

* ci: Run `auto-update-copy-pr-bot` only on forks (NVIDIA#2191)

Signed-off-by: oliver könig <[email protected]>

* Inference throughput tests: refactor goldens to be in list format (NVIDIA#2072)

* Enable TE custom quantization recipe (NVIDIA#2005)

Signed-off-by: Evgeny <[email protected]>
Signed-off-by: root <Evgeny>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: root <Evgeny>

* Add MoE parameters to ModelOpt pruning example + conf fixes (NVIDIA#2205)

Signed-off-by: Keval Morabia <[email protected]>

* Add repr to pg collection class (NVIDIA#2089)

Co-authored-by: Jared Casper <[email protected]>

* Move `data_samplers.py` from `legacy` to `training.datasets` & add `DistributedSignalHandler` to DataLoader workers (NVIDIA#2068)

* Fix Megatron-FSDP checkpoint save failure (NVIDIA#2138)

---------

Signed-off-by: oliver könig <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Youngeun Kwon <[email protected]>
Signed-off-by: Youngeun <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: ykarnati <[email protected]>
Signed-off-by: Deepak Narayanan <[email protected]>
Signed-off-by: Charlie Truong <[email protected]>
Signed-off-by: Keshav Santhanam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Asha Anoosheh <[email protected]>
Signed-off-by: Evgeny <[email protected]>
Signed-off-by: root <Evgeny>
Signed-off-by: Keval Morabia <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: Teodor-Dumitru Ene <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Youngeun Kwon <[email protected]>
Co-authored-by: Lawrence McAfee <[email protected]>
Co-authored-by: Mcore Bot <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Co-authored-by: Lawrence McAfee <[email protected]>
Co-authored-by: AJ Schmidt <[email protected]>
Co-authored-by: Yashaswi Karnati <[email protected]>
Co-authored-by: Deepak Narayanan <[email protected]>
Co-authored-by: helen ngo <[email protected]>
Co-authored-by: Robert Kirby <[email protected]>
Co-authored-by: kanz-nv <[email protected]>
Co-authored-by: Siddharth Singh <[email protected]>
Co-authored-by: Charlie Truong <[email protected]>
Co-authored-by: Keshav Santhanam <[email protected]>
Co-authored-by: Asha Anoosheh <[email protected]>
Co-authored-by: wdykas <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: William Dykas <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Ananth Subramaniam <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Co-authored-by: Teodor-Dumitru Ene <[email protected]>
Co-authored-by: Robert Kirby <[email protected]>
Co-authored-by: Evgeny Tsykunov <[email protected]>
Co-authored-by: Keval Morabia <[email protected]>
Co-authored-by: Jared Casper <[email protected]>
Co-authored-by: Antoni-Joan Solergibert <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Final Review Apply this label to indicate that your PR is ready for final review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants