-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
System Info
verl main ( 2026.01.27)
vllm 0.13.0 (updated from official docker image vllm012.exp3)
torch 2.9.1+cu129
GPU: 8* A800 80GB
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
When I tried to run DAPO scripts with fsdp2+vllm async for Qwen3-Next-80B-A3B-Instruct on 4 nodes, I encountered host OOM during actor loading model.
When I use fsdp_size=16, I noticed that the memory usage for Node 0 and Node 2 is about 320GB higher than Node 1 and Node 3 after initializing the ref model, and blooms when initializing the actor/rollout.
It seems that the weights in "full_state" are not freed after the following code, or there's some bug in set_state_dict from pytorch.
Does anyone meet the same issue?
Or should I use megatron instead of fsdp?
def _build_model_optimizer( ...):
init_context = get_init_weight_context_manager(
use_meta_tensor=not actor_model_config.tie_word_embeddings, mesh=self.device_mesh
)
...
with init_context(), warnings.catch_warnings():
...
actor_module = actor_module_class.from_pretrained(
pretrained_model_name_or_path=local_path,
torch_dtype=torch_dtype,
config=actor_model_config,
trust_remote_code=trust_remote_code,
attn_implementation=attn_implementation,
)
....
# Params loaded to CPU on rank0 here
full_state = actor_module.state_dict()
# Params converted to DTensor
apply_fsdp2(actor_module, fsdp_kwargs, fsdp_config)
# broadcast param weights from rank0 to others
fsdp2_load_full_state_dict(actor_module, full_state, fsdp_mesh, cpu_offload)
actor_module_fsdp = actor_module
Expected behavior
Memory usage for all nodes being similar.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working