Skip to content

chore(deps): upgrade mbridge from 0.15.1 to 310e8fb#1258

Merged
garrett4wade merged 5 commits intoinclusionAI:mainfrom
Adiactive:feat/upgrade-mbridge
Apr 25, 2026
Merged

chore(deps): upgrade mbridge from 0.15.1 to 310e8fb#1258
garrett4wade merged 5 commits intoinclusionAI:mainfrom
Adiactive:feat/upgrade-mbridge

Conversation

@Adiactive
Copy link
Copy Markdown
Contributor

@Adiactive Adiactive commented Apr 24, 2026

Description

Upgrade mbridge from PyPI release 0.15.1 to commit 310e8fb on main (~120 upstream commits). Picks up new model architectures (Qwen3-VL dense/MoE w/ CP, Qwen3.5 + MTP, Qwen3 Omni MoE, InternVL 3.5, DeepSeek NPU), compatibility fixes for megatron-core 0.14+/0.15/0.16 and transformers v5, and several bug fixes around RoPE, weight export, and distributed checkpointing. See the linked issue for the full rationale.

Changes:

  • Update mbridge requirement in pyproject.toml and pyproject.vllm.toml
  • Regenerate uv.lock and uv.vllm.lock via scripts/uv_lock.sh (ran with uv 0.10.9)

Related Issue

Fixes #1257

Type of Change

  • ♻️ Refactoring

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Additional Context

Test Environment

  • Image: ghcr.io/inclusionai/areal-runtime:v1.0.3-vllm
  • Mbridge installed via uv pip install --no-deps git+https://github.com/ISEEKYAN/mbridge.git@310e8fb to keep the existing torch/TE pin intact

Tests Run

Test Result Notes
tests/test_estimate_num_params.py ✅ 3/3 Direct mbridge.AutoBridge API
tests/test_megatron_engine.py ✅ 3/4 DCP test pre-existing fail (see below)
tests/test_megatron_engine_distributed.py ⚠️ 4/7 TP / PP / VPP / MoE-DCP pass; 3 pre-existing fails
tests/test_tree_training.py ✅ 12/12 flex/triton × {fsdp, megatron, archon}
tests/test_offload.py ✅ 2/2 FSDP and Megatron offload paths
tests/fp8/test_fp8_rmsnorm.py + test_fp8_bf16_comparison.py ⚠️ 2/4 2 pre-existing fails (see below)

Pre-existing failures unrelated to this upgrade

  1. test_dcp_save_load_weights (test_megatron_engine.py) — CheckpointingException: ShardedTensor.flattened_range is not supported. Known incompatibility between Megatron's distributed optimizer and dist-checkpointing format. Documented at cli_args.py:2177; workaround added in 7ca1fea0 for recovery. The test is @pytest.mark.slow and excluded from CI.

  2. test_qwen3_dcp_save_load (test_megatron_engine_distributed.py) — same flattened_range issue (uses with_optim=True).

  3. test_qwen3_context_parallel & test_qwen3moe_expert_parallel (test_megatron_engine_distributed.py) — KeyError: 'loss_mask' regression introduced by PR feat: CP-local loss and configurable CUDA memory profiling #1223 (d58cca56, merged 2026-04-23). The new CP-local loss path requires loss_mask in padded_mb, but the test's mock_input only provides input_ids / attention_mask. Adding loss_mask to the mock further uncovers a downstream unpack_sequence shape mismatch in the same CP-local path — both belong to PR feat: CP-local loss and configurable CUDA memory profiling #1223 follow-up work, out of scope for this PR.

  4. test_fp8_bf16_gradient_comparison & test_fp8_bf16_partial_layers_comparison (tests/fp8/test_fp8_bf16_comparison.py) — TypeError: sft_loss_fn() got an unexpected keyword argument 'vocab_min_logits'. Caused by PR fix(tree_attn): Fix some bugs in tree training for FSDP and Megatron engines #889 (127b0264) adding vocab_min_logits=... to the engine's loss_fn call, but the fp8 test fixture's sft_loss_fn at tests/fp8/model_hooks.py:127 doesn't accept **kwargs.

Lock file diff

The lock files lose a number of wheel entries for non-x86_64 platforms (ppc64le, s390x, armv7l, riscv64, musllinux). Switching mbridge from a PyPI wheel to a git source — combined with the existing platform_machine == 'x86_64' marker on mbridge in our pyproject — lets uv prove those wheels are unreachable through the mbridge subtree and prune them. AReaL targets x86_64 Linux clusters, so there is no real platform coverage loss.

Upgrade mbridge to support more model architectures and improve
compatibility with megatron-core 0.16.0.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the mbridge dependency in pyproject.toml and pyproject.vllm.toml to a specific git commit. The review feedback recommends using a full 40-character commit hash instead of a short SHA to ensure better security and reproducibility, and also suggests removing an extra space for consistency with other entries.

Comment thread pyproject.toml Outdated
Comment thread pyproject.vllm.toml Outdated
garrett4wade and others added 4 commits April 25, 2026 13:43
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

FYI in the future we will totally "mbridge" to "megatron-bridge" since it does not support transformers>=5.3.

@garrett4wade garrett4wade merged commit 4629c4e into inclusionAI:main Apr 25, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] chore(deps): upgrade mbridge from 0.15.1 to 310e8fb

2 participants