You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Great extension, works as advertised! Many thanks. I wonder if it makes sense to load a few layers in advance to better saturate the GPU. It would cost a little VRAM but would avoid almost any downtime in processing. I don't know if it can be done considering the ComfyUI/GGUF architecture or maybe it's already like that.
I assume all layers are initially offloaded to RAM as a most useful case. For example:
load layers 1, 2, 3
layer 1 starts computing
finishes, layer 2 starts computing
unload layer 1, load layer 4 while 2 is still working
layer 2 finishes, layer 3 starts
unload layer 2, load layer 5
and so on. The number of preloaded layers could be configurable to balance free VRAM/speed. 0 would mean current behavior, 1 means that 2 layers are in VRAM at the same time, 2 would look like the situation above etc. If I understand it correctly, currently the next layer is only loaded after the last layer finished processing which adds small gaps that can accumulate to some noticeable processing time. Some questions to consider:
the layer order seems to be deterministic but are these layers ordered at all? If not, then it's hard to tell which layer to load before it's explicitly requested. However, we can run the model once and collect the execution stats which shouldn't change, so they can be cached to disk and reused later.
is it worth it? No idea. Of course, the transfer/dequant time is short but it accumulates, especially if the number of layers is high (300+ in Hunyuan Video). It's also possible to load batches of layers instead of doing so one by one, sacrificing some VRAM and only getting small GPU downtime after each batch instead of each layer. Idk if it's not already implemented.
is it possible to hook into execution to preload/unload layers during execution?
would it actually harm performance if we run VRAM transfers in parallel with compute? GPUs can behave weirdly so I wouldn't be surprised if this optimization fires backwards.
Some results:
resolution × length
s/it
Virtual VRAM, GB
Slowdown
512x320x85
2.5
0
0%
512x320x85
3.02
24
20%
720x480x85
6.62
0
0%
720x480x85
7.22
24
9%
That's on 3090 Ti with WaveSpeed (TeaCache) enabled, so I suppose some steps were almost fully skipped after the first layer. The parameters were all the same, the skipped steps should also match in all cases. The difference seems to get smaller for longer/high res videos as block juggling takes a smaller share of time. Still, I'd absolutely love to set it and forget knowing that I wouldn't lose or gain anything by tweaking the VVRAM parameter 😄
The text was updated successfully, but these errors were encountered:
A major enhancement to ComfyUI-MultiGPU has been in the works for a while that contains as much as I know how to throw at the problem. It has been in heavy alpha for about a month or so (no work the past week as I had superseding claims on my time) over on rc1_dev if you want to take sneak peek on both enhancements for low-VRAM and multi-VRAM situations.
I have seen some significant improvements in some areas, and in others I am still in active debug looking at line_profiler and nsys to continue to knock-out CPU syncs, etc. for better performance.
I expect to open this up to testers sometime in the next week. I will be sure to pop this message once that work is complete.
Great extension, works as advertised! Many thanks. I wonder if it makes sense to load a few layers in advance to better saturate the GPU. It would cost a little VRAM but would avoid almost any downtime in processing. I don't know if it can be done considering the ComfyUI/GGUF architecture or maybe it's already like that.
I assume all layers are initially offloaded to RAM as a most useful case. For example:
and so on. The number of preloaded layers could be configurable to balance free VRAM/speed. 0 would mean current behavior, 1 means that 2 layers are in VRAM at the same time, 2 would look like the situation above etc. If I understand it correctly, currently the next layer is only loaded after the last layer finished processing which adds small gaps that can accumulate to some noticeable processing time. Some questions to consider:
Some results:
That's on 3090 Ti with WaveSpeed (TeaCache) enabled, so I suppose some steps were almost fully skipped after the first layer. The parameters were all the same, the skipped steps should also match in all cases. The difference seems to get smaller for longer/high res videos as block juggling takes a smaller share of time. Still, I'd absolutely love to set it and forget knowing that I wouldn't lose or gain anything by tweaking the VVRAM parameter 😄
The text was updated successfully, but these errors were encountered: