Description
Great extension, works as advertised! Many thanks. I wonder if it makes sense to load a few layers in advance to better saturate the GPU. It would cost a little VRAM but would avoid almost any downtime in processing. I don't know if it can be done considering the ComfyUI/GGUF architecture or maybe it's already like that.
I assume all layers are initially offloaded to RAM as a most useful case. For example:
- load layers 1, 2, 3
- layer 1 starts computing
- finishes, layer 2 starts computing
- unload layer 1, load layer 4 while 2 is still working
- layer 2 finishes, layer 3 starts
- unload layer 2, load layer 5
and so on. The number of preloaded layers could be configurable to balance free VRAM/speed. 0 would mean current behavior, 1 means that 2 layers are in VRAM at the same time, 2 would look like the situation above etc. If I understand it correctly, currently the next layer is only loaded after the last layer finished processing which adds small gaps that can accumulate to some noticeable processing time. Some questions to consider:
- the layer order seems to be deterministic but are these layers ordered at all? If not, then it's hard to tell which layer to load before it's explicitly requested. However, we can run the model once and collect the execution stats which shouldn't change, so they can be cached to disk and reused later.
- is it worth it? No idea. Of course, the transfer/dequant time is short but it accumulates, especially if the number of layers is high (300+ in Hunyuan Video). It's also possible to load batches of layers instead of doing so one by one, sacrificing some VRAM and only getting small GPU downtime after each batch instead of each layer. Idk if it's not already implemented.
- is it possible to hook into execution to preload/unload layers during execution?
- would it actually harm performance if we run VRAM transfers in parallel with compute? GPUs can behave weirdly so I wouldn't be surprised if this optimization fires backwards.
Some results:
resolution × length | s/it | Virtual VRAM, GB | Slowdown |
---|---|---|---|
512x320x85 | 2.5 | 0 | 0% |
512x320x85 | 3.02 | 24 | 20% |
720x480x85 | 6.62 | 0 | 0% |
720x480x85 | 7.22 | 24 | 9% |
That's on 3090 Ti with WaveSpeed (TeaCache) enabled, so I suppose some steps were almost fully skipped after the first layer. The parameters were all the same, the skipped steps should also match in all cases. The difference seems to get smaller for longer/high res videos as block juggling takes a smaller share of time. Still, I'd absolutely love to set it and forget knowing that I wouldn't lose or gain anything by tweaking the VVRAM parameter 😄