Skip to content

Commit

Permalink
Initial detailed phyt2v paper
Browse files Browse the repository at this point in the history
  • Loading branch information
hosiet committed Dec 24, 2024
1 parent 23a6e4d commit d5e0711
Show file tree
Hide file tree
Showing 5 changed files with 61 additions and 1 deletion.
Binary file added assets/media/2024-phyt2v/phyt2v-fig1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/media/2024-phyt2v/phyt2v-fig3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/media/2024-phyt2v/phyt2v-fig6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion content/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,8 @@ sections:
### [PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation](/publication/2024-phyt2v/) {id=phyt2v}
ArXiv preprint
{{< columns >}}
<--->
![Our iteration of video and prompt self-refinement in PhyT2V](phyt2v.png)
<--->
Text-to-video (T2V) generative AI could revolutionize many current and emerging application and industry domains. However, the capabilities of today's T2V generative models are mostly data dependent. While they perform well in domains covered by the training data, they usually fail to obey the real-world common knowledge and physical rules with out-of-distribution prompts. Expanding the model's capabilities, on the other hand, relies on large amounts of real-world data and is hence not scalable. Our recent work aims to address this limitation of data dependency, by fully unleashing the current T2V models' potential in scene generation given proper and detailed prompts. Our approach, namely PhyT2V, is a training-free technique that leverages the LLM's capabilities of chain-of-thought and step-back reasoning in the language domain, to logically identify the deficiency of generated videos and iteratively refine the current T2V models' video generation by correcting such deficiency with more precise and well articulated prompts. Check our preprint [here](https://arxiv.org/abs/2412.00596). We have also released a [Discord Bot](https://discord.com/channels/1312937020141732011/1314317637047812207) which allows you to try our work with SOTA T2V models.
{{< /columns >}}
{{< hr >}}
Expand Down
60 changes: 60 additions & 0 deletions content/publication/2024-phyt2v/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,3 +62,63 @@ image:
# Otherwise, set `slides: ""`.
slides:
---

## Overview

Text-to-video (T2V) generation with transformer-based diffusion model
can produce videos conditioned on textual prompts. These models demonstrate
astonishing capabilities of generating complex and photorealistic scenes,
but still have significant drawbacks in adhering to the real-world common
knowledge and physical rules, such as quantity, material,
fluid dynamics, gravity, motion, collision and causality, and such limitations
fundamentally prevent current T2V models from being used for real-world simulation.

Most existing solutions to these challenges are data-driven, by using large multimodal T2V
datasets that cover different real-world domain. However,
these solutions heavily rely on the volume, quality and diversity of datasets.
Since real-world common knowledge and physical rules are not explicitly embedded in the T2V generation process,
as shown in the figure below, the quality of video generation would largely drop in out-of-distribution
domains that are not covered
by the training dataset, and the generalizability of T2V models is limited due to the vast diversity of
real-world scenario domains.

![Quality drop in out-of-distribution prompts](2024-phyt2v/phyt2v-fig3.png)

To achieve generalizable enforcement of physics-grounded T2V generation, we propose a fundamentally
different approach: instead of expanding the training dataset or further complicating the T2V model
architecture, we aim to expand the current T2V model’s capability of video generation from
in-distribution to out-of-distribution domains, by embedding real-world knowledge and physical rules
into the text prompts with sufficient and appropriate contexts. To avoid ambiguous and unexplainable
prompt engineering, our basic idea is to enable chain-of-thought (CoT) and step-back
reasoning in T2V generation prompting, to ensure that T2V models follow correct physical dynamics
and inter-frame consistency by applying step-by-step guidance and iterative refinement.

### Our Approach

As shown in the figure below, reasoning is iteratively conducted in PhyT2V, and each iteration autonomously
refines both the T2V prompt and generated video in three steps.

![PhyT2V approach](2024-phyt2v/phyt2v.png)

In Step 1, the LLM analyzes the T2V prompt to extract objects
to be shown and physical rules to follow in the video via in-context learning. In Step 2, we first use
a video captioning model to translate the video’s semantic contents into texts according to the list of
objects obtained in Step 1, and then use the LLM to evaluate the mismatch between the video caption
and current T2V prompt via CoT reasoning. In Step 3, the LLM refines the current T2V prompt, by
incorporating the physical rules summarized in Step 1 and resolving the mismatch derived in Step 2,
through step-back prompting. The refined T2V prompt is then used by the T2V model again for video
generation, starting a new round of refinement. Such iterative refinement stops when the quality of
generated video is satisfactory or the improvement of video quality converges. You may find
an example of our prompt design for the 3 steps in the figure below.

![PhyT2V design](2024-phyt2v/phyt2v-fig6.png)

## Result Showcase

The images below show a comparison between videos generated by the current
text-to-video generation model (CogVideoX-5B) that cannot adhere to the
real-world physical rules (described in brackets following the user prompt,
and our method PhyT2V, when applied to the same model, better reflects
the real-world physical knowledge.

![PhyT2V improvements compared to SOTA T2V models](2024-phyt2v/phyt2v-fig1.png)

0 comments on commit d5e0711

Please sign in to comment.