-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Hi, first of all thanks for sharing this project. The idea and results look very promising.
I wanted to report an issue after several days of trying to get the pipeline running from scratch on a clean Windows setup (Anaconda + CUDA working correctly).
Main issue
During inference I consistently get this error:
Expected input to have 8 channels, but got 12 channels instead
This happens when running the provided inference script with Stable Video Diffusion XT checkpoints.
From debugging, this seems to be caused by a structural mismatch between the model’s expected input and how frames/masks are concatenated in the pipeline.
It looks like:
• The model expects 8 channels (likely 4 latent + 4 guidance).
• But the script stacks more context frames (e.g. 3 frames → 12 channels).
• This causes the first convolution layer to fail.
So even with:
• Correct CUDA
• Correct checkpoints
• Correct folder structure
The pipeline still breaks at a data architecture level, not an environment issue.
Secondary issue: mask naming & format
I also repeatedly hit:
Could not find a matching mask file
This seems to come from very strict and undocumented assumptions:
• Video file: video_name.mov
• Mask folder must be: mask/video_name/
• Mask frames must be named exactly in sequence.
Additionally, if masks are exported as RGBA (4 channels), this may also contribute to the channel mismatch.
None of this is clearly documented, and it makes the system extremely fragile for new users.
⸻
Suggestion
It would really help if:
1. The expected number of context frames / channels is documented explicitly.
2. The exact mask format is clarified:
• grayscale vs RGB vs RGBA
3. The script validates channel count before sending tensors to the model, with a clear error message.
4. A minimal working example dataset is provided (one video + masks that are known to work).
⸻
Why I’m reporting this
I spent multiple days trying to make this work, including:
• clean environments
• reinstalling everything
• checking drivers, CUDA, PyTorch versions
In the end the failure was not due to setup, but due to silent architectural assumptions in the pipeline.
I think many other users will hit the same wall.
The project is great, but at the moment it feels more like a research prototype than a usable tool without deep code inspection.
Hope this helps improve it. Thanks again for sharing your work.