Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does this actually work when unet is in cpu? #44

Open
Symbiomatrix opened this issue Mar 25, 2025 · 3 comments
Open

Does this actually work when unet is in cpu? #44

Symbiomatrix opened this issue Mar 25, 2025 · 3 comments

Comments

@Symbiomatrix
Copy link

Symbiomatrix commented Mar 25, 2025

My goal is not actually to run on cpu, but to temporarily move a model there in order to free memory for other programs (eg ollama) and quickly move it back once they're done.
As far as I can tell, this is only possible when the model is fully connected to an actual gen workflow, even a basic dummy one with 1 step - anything less and the model is completely unloaded.
I tried such a dummy workflow with a flux FP8 finetune as unet on cpu, and clip + t5x fp8 on gpu. The move is performed successfully, however when I reach the ksampler stage, I consistently receive the following errors:

NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
     query       : shape=(1, 4352, 24, 128) (torch.float32)
     key         : shape=(1, 4352, 24, 128) (torch.float32)
     value       : shape=(1, 4352, 24, 128) (torch.float32)
     attn_bias   : <class 'NoneType'>
     p           : 0.0
`decoderF` is not supported because:
    device=cpu (supported: {'cuda'})
    attn_bias type is <class 'NoneType'>
`[email protected]` is not supported because:
    device=cpu (supported: {'cuda'})
    dtype=torch.float32 (supported: {torch.bfloat16, torch.float16})
    operator wasn't built - see `python -m xformers.info` for more info
`cutlassF-pt` is not supported because:
    device=cpu (supported: {'cuda'})
`smallkF` is not supported because:
    max(query.shape[-1] != value.shape[-1]) > 32
    device=cpu (supported: {'cuda'})
    unsupported embed per head: 128

The model can be moved back to the gpu quickly at any point afterwards (so for a separate process it works), but due to the error, this dummy offloading is inadequate as a step in a bigger workflow. Running something else seems to trigger a full reload, though the model doesn't seem to be removed from ram.

What do you suggest?

@Mescalamba
Copy link

Well, if you load it into CPU, its simply on CPU and it doesnt work there.

But you can load model into GPU and let it offload 100% of it via virtual vram size setting.

And for example SDXL based models work even when offloaded fully to system memory. FLUX should too, if you have fast system ram, it can probably work really well.

@Symbiomatrix
Copy link
Author

Symbiomatrix commented Mar 26, 2025

@Mescalamba Perhaps I'm missing something - isn't the virtual vram setting available only for distorch, ie gguf? As stated, the model's an FP8 finetune. I'd love to convert it but city's process as it stands is far from trivial.
That's a neat trick though, hadn't thought of it.

@Mescalamba
Copy link

fp8 finetune can be probably transformed directly to Q8, just there are no tools for that, think I will make one, cause it happens often that I want to GGUF something and cant cause its fp8

Virtual ram works only on GGUF, you are right. I wasnt exactly sure what you aiming for with loading diffusion model into CPU/RAM. That obviously wont work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants