Skip to content

[RFC]: CPU offloading support #412

@LawJarp-A

Description

@LawJarp-A

Motivation

While validating large diffusion models in vLLM-Omni (e.g., during #302), model initialization and execution required CPU offloading even on high-memory GPUs (e.g., H100).

More generally, CPU offloading also enables:

  • Execution of larger models on constrained GPU setups
  • Improved utilization of host memory for large or infrequently accessed components

Proposed Change

Add a hook-based CPU offloading mechanism that integrates with vLLM-Omni’s existing HookRegistry.

  • Introduce CPUOffloadHook (extends ModelHook)

    • Intercepts module execution via new_forward()
    • Performs per-forward device transfers (CPU ↔ GPU)
    • Supports coordinated offloading across multiple modules
  • Introduce CPUOffloadBackend

    • Registers offload hooks on diffusion pipeline components (text_encoder, transformer, VAE, image_encoder)
    • Controlled via OmniDiffusionConfig flags (e.g., dit_cpu_offload, text_encoder_cpu_offload)
    • Initialized during GPUWorker pipeline construction
  • No diffusion pipeline code changes required

    • Device placement handled entirely via hooks
    • Compatible with existing hook-based features (e.g., TeaCache)

Please refer #405 for initial implementation

The initial implementation prioritizes correctness and extensibility over performance optimizations.


Feedback Period

Open for discussion.


CC List

@SamitHuang @ZJY0516


Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions