[BUG] Mean reward and acc increasing steadily on Qwen-VL-MoE-32B-A3B with vanilla mbridge setup but not with Megatron-Bridge on example scripts (8×H200)

### System Info

Environment
	•	GPU: 8×H200
	•	Docker image: verlai/verl:vllm011.latest

----------Python Info----------
Version      : 3.12.11
Compiler     : GCC 11.4.0
Build        : ('main', 'Jun  4 2025 08:56:18')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 25.2
Directory    : /usr/local/lib/python3.12/dist-packages/pip
vllm         : 0.11.0
sglang       : not found.
ray          : 2.49.2
torch        : 2.8.0+cu128
----------verl Info-----------
Version      : 0.8.0.dev
Directory    : /root/verl/verl
Commit Hash  : b53f0f132d6d6536ae8901944b24b98dc6dd6cdf
----------Platform Info----------
Platform     : Linux-5.10.134-013.5.kangaroo.al8.x86_64-x86_64-with-glibc2.35
system       : Linux
node         : dsw-318478-6bdc478975-5r8xm
release      : 5.10.134-013.5.kangaroo.al8.x86_64
version      : #1 SMP Thu Nov 20 02:46:27 UTC 2025
----------Environment----------
CUDA Runtime : 12.8
CUDA compiler : Not found: [Errno 2] No such file or directory: 'nvcc'
----------System Info----------
CPU Memory      : 1800.00 GB
GPU Count       : 8
GPU 1   Type    : NVIDIA L20X
GPU 1   Memory  : 140.40 GB
GPU 2   Type    : NVIDIA L20X
GPU 2   Memory  : 140.40 GB
GPU 3   Type    : NVIDIA L20X
GPU 3   Memory  : 140.40 GB
GPU 4   Type    : NVIDIA L20X
GPU 4   Memory  : 140.40 GB
GPU 5   Type    : NVIDIA L20X
GPU 5   Memory  : 140.40 GB
GPU 6   Type    : NVIDIA L20X
GPU 6   Memory  : 140.40 GB
GPU 7   Type    : NVIDIA L20X
GPU 7   Memory  : 140.40 GB
GPU 8   Type    : NVIDIA L20X
GPU 8   Memory  : 140.40 GB

### Information

- [x] The official example scripts
- [x] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

With the same training script and hardware setup:
	•	Vanilla mbridge training loss converges normally
	•	Megatron-Bridge training loss does not converge

✅ Vanilla mbridge Setup (show progress clearly, mean reward increasing steadily)
With
```
megatron-core==0.15.0
```
and script
```
examples/grpo_trainer/run_qwen3_vl-30b-megatron.sh
```
The training is stable and normal

<img width="364" height="544" alt="Image" src="https://github.com/user-attachments/assets/a342275e-685f-4f79-9969-0c0c6bcb237e" />

❌ Megatron-Bridge Setup  (mean reward not increasing)

Apply the following changes:
Package Version Changes
	•	megatron-core package version → @ main
	•	megatron-bridge package version → use PR build:
```
https://github.com/NVIDIA-NeMo/Megatron-Bridge/pull/1943
```
In examples/grpo_trainer/run_qwen3_vl-30b-megatron.sh, add:
```
actor_rollout_ref.actor.megatron.vanilla_mbridge=False

```

<img width="348" height="538" alt="Image" src="https://github.com/user-attachments/assets/39ef331a-942d-4503-bdad-9c1daa2c8a00" />








### Expected behavior

Vanilla mbridge Setup
With
```
megatron-core==0.15.0
```

<img width="364" height="544" alt="Image" src="https://github.com/user-attachments/assets/a342275e-685f-4f79-9969-0c0c6bcb237e" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Mean reward and acc increasing steadily on Qwen-VL-MoE-32B-A3B with vanilla mbridge setup but not with Megatron-Bridge on example scripts (8×H200) #5187

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Mean reward and acc increasing steadily on Qwen-VL-MoE-32B-A3B with vanilla mbridge setup but not with Megatron-Bridge on example scripts (8×H200) #5187

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions