Initial detailed phyt2v paper

pittisl · Dec 24, 2024 · d5e0711 · d5e0711
1 parent 23a6e4d
commit d5e0711
Show file tree

Hide file tree

Showing 5 changed files with 61 additions and 1 deletion.
diff --git a/assets/media/2024-phyt2v/phyt2v-fig1.png b/assets/media/2024-phyt2v/phyt2v-fig1.png
diff --git a/assets/media/2024-phyt2v/phyt2v-fig3.png b/assets/media/2024-phyt2v/phyt2v-fig3.png
diff --git a/assets/media/2024-phyt2v/phyt2v-fig6.png b/assets/media/2024-phyt2v/phyt2v-fig6.png
diff --git a/content/_index.md b/content/_index.md
@@ -85,8 +85,8 @@ sections:
         ### [PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation](/publication/2024-phyt2v/) {id=phyt2v}
         ArXiv preprint
         {{< columns >}}
-        <--->
         ![Our iteration of video and prompt self-refinement in PhyT2V](phyt2v.png)
+        <--->
         Text-to-video (T2V) generative AI could revolutionize many current and emerging application and industry domains. However, the capabilities of today's T2V generative models are mostly data dependent. While they perform well in domains covered by the training data, they usually fail to obey the real-world common knowledge and physical rules with out-of-distribution prompts. Expanding the model's capabilities, on the other hand, relies on large amounts of real-world data and is hence not scalable. Our recent work aims to address this limitation of data dependency, by fully unleashing the current T2V models' potential in scene generation given proper and detailed prompts. Our approach, namely PhyT2V, is a training-free technique that leverages the LLM's capabilities of chain-of-thought and step-back reasoning in the language domain, to logically identify the deficiency of generated videos and iteratively refine the current T2V models' video generation by correcting such deficiency with more precise and well articulated prompts. Check our preprint [here](https://arxiv.org/abs/2412.00596). We have also released a [Discord Bot](https://discord.com/channels/1312937020141732011/1314317637047812207) which allows you to try our work with SOTA T2V models.
         {{< /columns >}}
         {{< hr >}}

diff --git a/content/publication/2024-phyt2v/index.md b/content/publication/2024-phyt2v/index.md
@@ -62,3 +62,63 @@ image:
 #   Otherwise, set `slides: ""`.
 slides:
 ---
+
+## Overview
+
+Text-to-video (T2V) generation with transformer-based diffusion model
+can produce videos conditioned on textual prompts. These models demonstrate
+astonishing capabilities of generating complex and photorealistic scenes,
+but still have significant drawbacks in adhering to the real-world common
+knowledge and physical rules, such as quantity, material,
+fluid dynamics, gravity, motion, collision and causality, and such limitations
+fundamentally prevent current T2V models from being used for real-world simulation.
+
+Most existing solutions to these challenges are data-driven, by using large multimodal T2V
+datasets that cover different real-world domain. However,
+these solutions heavily rely on the volume, quality and diversity of datasets.
+Since real-world common knowledge and physical rules are not explicitly embedded in the T2V generation process,
+as shown in the figure below, the quality of video generation would largely drop in out-of-distribution
+domains that are not covered
+by the training dataset, and the generalizability of T2V models is limited due to the vast diversity of
+real-world scenario domains.
+
+![Quality drop in out-of-distribution prompts](2024-phyt2v/phyt2v-fig3.png)
+
+To achieve generalizable enforcement of physics-grounded T2V generation, we propose a fundamentally
+different approach: instead of expanding the training dataset or further complicating the T2V model
+architecture, we aim to expand the current T2V model’s capability of video generation from
+in-distribution to out-of-distribution domains, by embedding real-world knowledge and physical rules
+into the text prompts with sufficient and appropriate contexts. To avoid ambiguous and unexplainable
+prompt engineering, our basic idea is to enable chain-of-thought (CoT) and step-back
+reasoning in T2V generation prompting, to ensure that T2V models follow correct physical dynamics
+and inter-frame consistency by applying step-by-step guidance and iterative refinement.
+
+### Our Approach
+
+As shown in the figure below, reasoning is iteratively conducted in PhyT2V, and each iteration autonomously
+refines both the T2V prompt and generated video in three steps.
+
+![PhyT2V approach](2024-phyt2v/phyt2v.png)
+
+In Step 1, the LLM analyzes the T2V prompt to extract objects
+to be shown and physical rules to follow in the video via in-context learning. In Step 2, we first use
+a video captioning model to translate the video’s semantic contents into texts according to the list of
+objects obtained in Step 1, and then use the LLM to evaluate the mismatch between the video caption
+and current T2V prompt via CoT reasoning. In Step 3, the LLM refines the current T2V prompt, by
+incorporating the physical rules summarized in Step 1 and resolving the mismatch derived in Step 2,
+through step-back prompting. The refined T2V prompt is then used by the T2V model again for video
+generation, starting a new round of refinement. Such iterative refinement stops when the quality of
+generated video is satisfactory or the improvement of video quality converges. You may find
+an example of our prompt design for the 3 steps in the figure below.
+
+![PhyT2V design](2024-phyt2v/phyt2v-fig6.png)
+
+## Result Showcase
+
+The images below show a comparison between videos generated by the current
+text-to-video generation model (CogVideoX-5B) that cannot adhere to the
+real-world physical rules (described in brackets following the user prompt,
+and our method PhyT2V, when applied to the same model, better reflects
+the real-world physical knowledge.
+
+![PhyT2V improvements compared to SOTA T2V models](2024-phyt2v/phyt2v-fig1.png)