Inference pipeline breaks due to channel mismatch (expected 8, got 12) and unclear mask conventions

Hi, first of all thanks for sharing this project. The idea and results look very promising.

I wanted to report an issue after several days of trying to get the pipeline running from scratch on a clean Windows setup (Anaconda + CUDA working correctly).

Main issue

During inference I consistently get this error:

Expected input to have 8 channels, but got 12 channels instead

This happens when running the provided inference script with Stable Video Diffusion XT checkpoints.

From debugging, this seems to be caused by a structural mismatch between the model’s expected input and how frames/masks are concatenated in the pipeline.

It looks like:
	•	The model expects 8 channels (likely 4 latent + 4 guidance).
	•	But the script stacks more context frames (e.g. 3 frames → 12 channels).
	•	This causes the first convolution layer to fail.

So even with:
	•	Correct CUDA
	•	Correct checkpoints
	•	Correct folder structure

The pipeline still breaks at a data architecture level, not an environment issue.

Secondary issue: mask naming & format

I also repeatedly hit:

Could not find a matching mask file

This seems to come from very strict and undocumented assumptions:
	•	Video file: video_name.mov
	•	Mask folder must be: mask/video_name/
	•	Mask frames must be named exactly in sequence.

Additionally, if masks are exported as RGBA (4 channels), this may also contribute to the channel mismatch.

None of this is clearly documented, and it makes the system extremely fragile for new users.

⸻

Suggestion

It would really help if:
	1.	The expected number of context frames / channels is documented explicitly.
	2.	The exact mask format is clarified:
	•	grayscale vs RGB vs RGBA
	3.	The script validates channel count before sending tensors to the model, with a clear error message.
	4.	A minimal working example dataset is provided (one video + masks that are known to work).

⸻

Why I’m reporting this

I spent multiple days trying to make this work, including:
	•	clean environments
	•	reinstalling everything
	•	checking drivers, CUDA, PyTorch versions

In the end the failure was not due to setup, but due to silent architectural assumptions in the pipeline.

I think many other users will hit the same wall.

The project is great, but at the moment it feels more like a research prototype than a usable tool without deep code inspection.

Hope this helps improve it. Thanks again for sharing your work.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference pipeline breaks due to channel mismatch (expected 8, got 12) and unclear mask conventions #11

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Inference pipeline breaks due to channel mismatch (expected 8, got 12) and unclear mask conventions #11

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions