-
Notifications
You must be signed in to change notification settings - Fork 265
Open
Description
Motivation
While validating large diffusion models in vLLM-Omni (e.g., during #302), model initialization and execution required CPU offloading even on high-memory GPUs (e.g., H100).
More generally, CPU offloading also enables:
- Execution of larger models on constrained GPU setups
- Improved utilization of host memory for large or infrequently accessed components
Proposed Change
Add a hook-based CPU offloading mechanism that integrates with vLLM-Omni’s existing HookRegistry.
-
Introduce
CPUOffloadHook(extendsModelHook)- Intercepts module execution via
new_forward() - Performs per-forward device transfers (CPU ↔ GPU)
- Supports coordinated offloading across multiple modules
- Intercepts module execution via
-
Introduce
CPUOffloadBackend- Registers offload hooks on diffusion pipeline components (
text_encoder,transformer,VAE,image_encoder) - Controlled via
OmniDiffusionConfigflags (e.g.,dit_cpu_offload,text_encoder_cpu_offload) - Initialized during
GPUWorkerpipeline construction
- Registers offload hooks on diffusion pipeline components (
-
No diffusion pipeline code changes required
- Device placement handled entirely via hooks
- Compatible with existing hook-based features (e.g., TeaCache)
Please refer #405 for initial implementation
The initial implementation prioritizes correctness and extensibility over performance optimizations.
Feedback Period
Open for discussion.
CC List
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
No labels