Skip to content

Host memory unecessary usage in FSDP rank 0 #5125

@flishwang

Description

@flishwang

System Info

verl main ( 2026.01.27)
vllm 0.13.0 (updated from official docker image vllm012.exp3)
torch 2.9.1+cu129
GPU: 8* A800 80GB

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

When I tried to run DAPO scripts with fsdp2+vllm async for Qwen3-Next-80B-A3B-Instruct on 4 nodes, I encountered host OOM during actor loading model.
When I use fsdp_size=16, I noticed that the memory usage for Node 0 and Node 2 is about 320GB higher than Node 1 and Node 3 after initializing the ref model, and blooms when initializing the actor/rollout.

It seems that the weights in "full_state" are not freed after the following code, or there's some bug in set_state_dict from pytorch.

Does anyone meet the same issue?
Or should I use megatron instead of fsdp?

def _build_model_optimizer( ...):
    init_context = get_init_weight_context_manager(
            use_meta_tensor=not actor_model_config.tie_word_embeddings, mesh=self.device_mesh
        )
    ...
    with init_context(), warnings.catch_warnings():
        ...
        actor_module = actor_module_class.from_pretrained(
                pretrained_model_name_or_path=local_path,
                torch_dtype=torch_dtype,
                config=actor_model_config,
                trust_remote_code=trust_remote_code,
                attn_implementation=attn_implementation,
            )
    ....
    # Params loaded to CPU on rank0 here
    full_state = actor_module.state_dict()
    # Params converted to DTensor
    apply_fsdp2(actor_module, fsdp_kwargs, fsdp_config)
    # broadcast param weights from rank0 to others
    fsdp2_load_full_state_dict(actor_module, full_state, fsdp_mesh, cpu_offload)
    actor_module_fsdp = actor_module

Expected behavior

Memory usage for all nodes being similar.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions