Skip to content

(ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.

License

Notifications You must be signed in to change notification settings

dawei03896/Awesome-Text-to-Image

 
 

Repository files navigation

𝓐𝔀𝓮𝓼𝓸𝓶𝓮 𝓣𝓮𝔁𝓽📝-𝓽𝓸-𝓘𝓶𝓪𝓰𝓮🌇

GitHub stars GitHub forks GitHub activity GitHub issues GitHub closed issues

Awesome Hits

𝓐 𝓬𝓸𝓵𝓵𝓮𝓬𝓽𝓲𝓸𝓷 𝓸𝓯 𝓻𝓮𝓼𝓸𝓾𝓻𝓬𝓮𝓼 𝓸𝓷 𝓽𝓮𝔁𝓽-𝓽𝓸-𝓲𝓶𝓪𝓰𝓮 𝓼𝔂𝓷𝓽𝓱𝓮𝓼𝓲𝓼/𝓶𝓪𝓷𝓲𝓹𝓾𝓵𝓪𝓽𝓲𝓸𝓷 𝓽𝓪𝓼𝓴𝓼.

⭐ Citation

If you find this paper and repo helpful for your research, please cite it below:

@inproceedings{zhou2023vision+,
  title={Vision+ Language Applications: A Survey},
  author={Zhou, Yutong and Shimada, Nobutaka},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={826--842},
  year={2023}
}

🎑 News

Tip

Version 1.0 (All-in-one version) can be found here and will be stop updating from 24/02/29.

  • [24/02/29] Update "Awesome Text to Image" Version 2.0! Paper With Code and Other Related Works will also be gradually updated in March.
  • [23/05/26] 🔥 Add our survey paper "Vision + Language Applications: A Survey" and a special Best Collection list!
  • [23/04/04] "Vision + Language Applications: A Survey" was accepted by CVPRW2023.
  • [20/10/13] Awesome-Text-to-Image repo is created.

To Do

Content

Description

  • In the last few decades, the fields of Computer Vision (CV) and Natural Language Processing (NLP) have been made several major technological breakthroughs in deep learning research. Recently, researchers appear interested in combining semantic information and visual information in these traditionally independent fields. A number of studies have been conducted on text-to-image synthesis techniques that transfer input textual descriptions (keywords or sentences) into realistic images.

  • Papers, codes, and datasets for the text-to-image task are available here.

🐌 Markdown Format:

Paper With Code

  • Text to Face👨🏻🧒👧🏼🧓🏽
    • (arXiv preprint 2024) Fast Text-to-3D-Aware Face Generation and Manipulation via Direct Cross-modal Mapping and Geometric Regularization, Jinlu Zhang et al. [Paper] [Code]
    • (IJACSA 2023) Mukh-Oboyob: Stable Diffusion and BanglaBERT enhanced Bangla Text-to-Face Synthesis, Aloke Kumar Saha et al. [Paper] [Code]
    • (SIGGRAPH 2023) [💬 3D] DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance, Longwen Zhang et al. [Paper] [Project] [HuggingFace]
    • (CVPR 2023) [💬 3D] High-Fidelity 3D Face Generation from Natural Language Descriptions, Menghua Wu et al. [Paper] [Code] [Project]
    • (CVPR 2023) Collaborative Diffusion for Multi-Modal Face Generation and Editing, Ziqi Huang et al. [Paper] [Code] [Project]
    • (Pattern Recognition 2023) Where you edit is what you get: Text-guided image editing with region-based attention, Changming Xiao et al. [Paper] [Code]
    • (arXiv preprint 2022) Bridging CLIP and StyleGAN through Latent Alignment for Image Editing, Wanfeng Zheng et al. [Paper]
    • (ACMMM 2022) Learning Dynamic Prior Knowledge for Text-to-Face Pixel Synthesis, Jun Peng et al. [Paper]
    • (ACMMM 2022) Towards Open-Ended Text-to-Face Generation, Combination and Manipulation, Jun Peng et al. [Paper]
    • (BMVC 2022) clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP, Justin N. M. Pinkney et al. [Paper] [Code]
    • (arXiv preprint 2022) ManiCLIP: Multi-Attribute Face Manipulation from Text, Hao Wang et al. [Paper]
    • (arXiv preprint 2022) Generated Faces in the Wild: Quantitative Comparison of Stable Diffusion, Midjourney and DALL-E 2, Ali Borji, [Paper] [Code] [Data]
    • (arXiv preprint 2022) Text-Free Learning of a Natural Language Interface for Pretrained Face Generators, Xiaodan Du et al. [Paper] [Code]
    • (Knowledge-Based Systems-2022) CMAFGAN: A Cross-Modal Attention Fusion based Generative Adversarial Network for attribute word-to-face synthesis, Xiaodong Luo et al. [Paper]
    • (Neural Networks-2022) DualG-GAN, a Dual-channel Generator based Generative Adversarial Network for text-to-face synthesis, Xiaodong Luo et al. [Paper]
    • (arXiv preprint 2022) Text-to-Face Generation with StyleGAN2, D. M. A. Ayanthi et al. [Paper]
    • (CVPR 2022) StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis, Zhiheng Li et al. [Paper] [Code]
    • (arXiv preprint 2022) StyleT2F: Generating Human Faces from Textual Description Using StyleGAN2, Mohamed Shawky Sabae et al. [Paper] [Code]
    • (CVPR 2022) AnyFace: Free-style Text-to-Face Synthesis and Manipulation, Jianxin Sun et al. [Paper]
    • (IEEE Transactions on Network Science and Engineering-2022) TextFace: Text-to-Style Mapping based Face Generation and Manipulation, Xianxu Hou et al. [Paper]
    • (CVPR 2021) TediGAN: Text-Guided Diverse Image Generation and Manipulation, Weihao Xia et al. [Paper] [Extended Version][Code] [Dataset] [Colab] [Video]
    • (FG 2021) Generative Adversarial Network for Text-to-Face Synthesis and Manipulation with Pretrained BERT Model, Yutong Zhou et al. [Paper]
    • (ACMMM 2021) Multi-caption Text-to-Face Synthesis: Dataset and Algorithm, Jianxin Sun et al. [Paper] [Code]
    • (ACMMM 2021) Generative Adversarial Network for Text-to-Face Synthesis and Manipulation, Yutong Zhou. [Paper]
    • (WACV 2021) Faces a la Carte: Text-to-Face Generation via Attribute Disentanglement, Tianren Wang et al. [Paper]
    • (arXiv preprint 2019) FTGAN: A Fully-trained Generative Adversarial Networks for Text to Face Generation, Xiang Chen et al. [Paper]

<🎯Back to Top>

  • Complex Issues🤔
    • (arXiv preprint 2024) [💬 Aesthetic] Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation, Daiqing Li et al. [Paper] [Project] [HuggingFace]
    • (EMNLP 2023) [💬 Text Visualness] Learning the Visualness of Text Using Large Vision-Language Models, Gaurav Verma et al. [Paper] [Project]
    • (arXiv preprint 2023) [💬 Against Malicious Adaptation] IMMA: Immunizing text-to-image Models against Malicious Adaptation, Yijia Zheng et al. [Paper] [Project]
    • (arXiv preprint 2023) [💬 Principled Recaptioning] A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation, Eyal Segalis et al. [Paper]
    • ⭐⭐(NeurIPS 2023) [💬 Holistic Evaluation] Holistic Evaluation of Text-To-Image Models, Tony Lee et al. [Paper] [Code] [Project]
    • (ICCV 2023) [💬 Safety] Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis, Lukas Struppek et al. [Paper] [Code]
    • (arXiv preprint 2023) [💬 Natural Attack Capability] Intriguing Properties of Diffusion Models: A Large-Scale Dataset for Evaluating Natural Attack Capability in Text-to-Image Generative Models, Takami Sato et al. [Paper]
    • (ACL 2023) [💬 Bias] A Multi-dimensional study on Bias in Vision-Language models, Gabriele Ruggeri et al. [Paper]
    • (FAACT 2023) [💬 Demographic Stereotypes] Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale, Federico Bianchi et al. [Paper]
    • (arXiv preprint 2023) [💬 Robustness] Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks, Hongcheng Gao et al. [Paper]
    • (CVPR 2023) [💬 Adversarial Robustness Analysis] RIATIG: Reliable and Imperceptible Adversarial Text-to-Image Generation With Natural Prompts, Han Liu et al. [Paper]
    • (arXiv preprint 2023) [💬 Textual Inversion] Is This Loss Informative? Speeding Up Textual Inversion with Deterministic Objective Evaluation, Anton Voronov et al. [Paper] [Code]
    • (arXiv preprint 2022) [💬 Interpretable Intervention] Not Just Pretty Pictures: Text-to-Image Generators Enable Interpretable Interventions for Robust Representations, Jianhao Yuan et al. [Paper]
    • (arXiv preprint 2022) [💬 Ethical Image Manipulation] Judge, Localize, and Edit: Ensuring Visual Commonsense Morality for Text-to-Image Generation, Seongbeom Park et al. [Paper]
    • (arXiv preprint 2022) [💬 Creativity Transfer] Inversion-Based Creativity Transfer with Diffusion Models, Yuxin Zhang et al. [Paper]
    • (arXiv preprint 2022) [💬 Ambiguity] Is the Elephant Flying? Resolving Ambiguities in Text-to-Image Generative Models, Ninareh Mehrabi et al. [Paper]
    • (arXiv preprint 2022) [💬 Racial Politics] A Sign That Spells: DALL-E 2, Invisual Images and The Racial Politics of Feature Space, Fabian Offert et al. [Paper]
    • (arXiv preprint 2022) [💬 Privacy Analysis] Membership Inference Attacks Against Text-to-image Generation Models, Yixin Wu et al. [Paper]
    • (arXiv preprint 2022) [💬 Authenticity Evaluation for Fake Images] DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Diffusion Models, Zeyang Sha et al. [Paper]
    • (arXiv preprint 2022) [💬 Cultural Bias] The Biased Artist: Exploiting Cultural Biases via Homoglyphs in Text-Guided Image Generation Models, Lukas Struppek et al. [Paper]

<🎯Back to Top>

  • 2024
    • (arXiv preprint 2024) SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data, Jialu Li et al. [Paper] [Project] [Code]
    • (ICLR 2024) PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis, Junsong Chen et al. [Paper] [Project] [Code] [Hugging Face]
    • (arXiv preprint 2024) PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models, Junsong Chen et al. [Paper]
    • (CVPR 2024) Discriminative Probing and Tuning for Text-to-Image Generation, Leigang Qu et al. [Paper] [Project]
    • (CVPR 2024) RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization, Mengqi Huang et al. [Paper] [Project]
    • ⭐(arXiv preprint 2024) SDXL-Lightning: Progressive Adversarial Diffusion Distillation, Shanchuan Lin et al. [Paper] [HuggingFace] [Demo]
    • ⭐(arXiv preprint 2024) RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models, Xinchen Zhang et al. [Paper] [Code]
    • (arXiv preprint 2024) Learning Continuous 3D Words for Text-to-Image Generation, Ta-Ying Cheng et al. [Paper] [Project] [Code]
    • (arXiv preprint 2024) DiffusionGPT: LLM-Driven Text-to-Image Generation System, Jie Qin et al. [Paper] [Project] [Code]
    • (arXiv preprint 2024) DressCode: Autoregressively Sewing and Generating Garments from Text Guidance, Kai He et al. [Paper] [Project]

<🎯Back to Top>

  • 2023
    • (arXiv preprint 2023) ElasticDiffusion: Training-free Arbitrary Size Image Generation, Moayed Haji-Ali et al. [Paper] [Project] [Code] [Demo]
    • (ICCV 2023) BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion, Jinheng Xie et al. [Paper] [Code]
    • (arXiv preprint 2023) Late-Constraint Diffusion Guidance for Controllable Image Synthesis, Chang Liu et al. [Paper] [Code]
    • (arXiv preprint 2023) An Image is Worth Multiple Words: Multi-attribute Inversion for Constrained Text-to-Image Synthesis, Aishwarya Agarwal et al. [Paper]
    • ⭐(arXiv preprint 2023) UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs, Yanwu Xu et al. [Paper]
    • (ICCV 2023) ITI-GEN: Inclusive Text-to-Image Generation, Cheng Zhang et al. [Paper] [Code] [Project]
    • (arXiv preprint 2023) Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models, Zeqiang Lai et al. [Paper] [Code] [Demo] [Project]
    • (arXiv preprint 2023) [💬Evaluation] GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment, Dhruba Ghosh et al. [Paper] [Code]
    • ⭐(arXiv preprint 2023) Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion, Anton Razzhigaev et al. [Paper] [Code] [Demo] [Demo Video] [Hugging Face]
    • ⭐⭐(ICCV 2023) Adding Conditional Control to Text-to-Image Diffusion Models, Lvmin Zhang et al. [Paper] [Code]
    • (ICCV 2023) DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment, Xujie Zhang et al. [Paper]
    • (ICCV 2023) Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models, Nan Liu et al. [Paper] [Code] [Project]
    • (arXiv preprint 2023) Text-to-Image Generation for Abstract Concepts, Jiayi Liao et al. [Paper]
    • (arXiv preprint 2023) T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation, Kaiyi Huang et al. [Paper] [Code] [Project]
    • (arXiv preprint 2023) [💬Human Preference Evaluation] Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis, Xiaoshi Wu et al. [Paper] [Code]
    • (arXiv preprint 2023) Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark, Shuyu Yang et al. [Paper] [Code] [Project]
    • (arXiv preprint 2023) Synthesizing Artistic Cinemagraphs from Text, Aniruddha Mahapatra et al. [Paper] [Code] [Project]
    • (arXiv preprint 2023) Detector Guidance for Multi-Object Text-to-Image Generation, Luping Liu et al. [Paper]
    • (arXiv preprint 2023) A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis, Aishwarya Agarwal et al. [Paper]
    • (arXiv preprint 2023) [💬Evaluation] ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models, Maitreya Patel et al. [Paper] [Code] [Project]
    • ⭐(arXiv preprint 2023) StyleDrop: Text-to-Image Generation in Any Style, Kihyuk Sohn et al. [Paper] [Project]
    • ⭐⭐(arXiv preprint 2023) Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models, Xingqian Xu et al. [Paper] [Code] [Hugging Face]
    • ⭐⭐ (SIGGRAPH 2023) Blended Latent Diffusion, Omri Avrahami et al. [Paper] [Code] [Project]
    • (CVPR 2023) [💬Controllable] SpaText: Spatio-Textual Representation for Controllable Image Generation, Omri Avrahami et al. [Paper] [Project]
    • ⭐⭐ (arXiv 2023) The Chosen One: Consistent Characters in Text-to-Image Diffusion Models, Omri Avrahami et al. [Paper] [Code] [Project]
    • (CVPR 2023) [💬Stable Diffusion with Brain] High-resolution image reconstruction with latent diffusion models from human brain activity, Yu Takagi et al. [Paper]
    • (arXiv preprint 2023) BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing, Dongxu Li et al. [Paper]
    • (arXiv preprint 2023) [💬Evaluation] LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation, Yujie Lu et al. [Paper] [Code]
    • (arXiv preprint 2023) P+ : Extended Textual Conditioning in Text-to-Image Generation, Andrey Voynov et al. [Paper] [Project]
    • (arXiv preprint 2023) Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models, Xuhui Jia et al. [Paper]
    • (ICML 2023) TR0N: Translator Networks for 0-Shot Plug-and-Play Conditional Generation, Zhaoyan Liu et al. [Paper] [Code] [Hugging Face]
    • (ICLR 2023) [💬3D]DreamFusion: Text-to-3D using 2D Diffusion, Ben Poole et al. [Paper (arXiv)] [Paper (OpenReview)] [Project] [Short Read]
    • (ICLR 2023) Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis, Weixi Feng et al. [Paper (arXiv)] [Paper (OpenReview)] [Code]
    • ⭐⭐(arXiv preprint 2023) Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation, Yuval Kirstain et al. [Paper] [Code] [Dataset] [Online Application] [PickScore]
    • (arXiv preprint 2023) TTIDA: Controllable Generative Data Augmentation via Text-to-Text and Text-to-Image Models, Yuwei Yin et al. [Paper]
    • (arXiv preprint 2023) [💬 Textual Inversion] Controllable Textual Inversion for Personalized Text-to-Image Generation, Jianan Yang et al. [Paper]
    • (arXiv preprint 2023) Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion, Seongmin Lee et al. [Paper] [Project]
    • ⭐⭐(Findings of ACL 2023) [💬 Multi-language-to-Image] AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities, Zhongzhi Chen et al. [Paper] [Code-AltDiffusion] [Code-AltCLIP] [Hugging Face]
    • (arXiv preprint 2023) [💬 Seed selection] It is all about where you start: Text-to-image generation with seed selection, Dvir Samuel et al. [Paper]
    • (arXiv preprint 2023) [💬 Audio/Sound/Multi-language-to-Image] GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation, Can Qin et al. [Paper]
    • (arXiv preprint 2023) [💬Faithfulness Evaluation] TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering, Yushi Hu et al. [Paper] [Project] [Code]
    • (arXiv preprint 2023) InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning, Jing Shi et al. [Paper] [Project]
    • (TOMM 2023) LFR-GAN: Local Feature Refinement based Generative Adversarial Network for Text-to-Image Generation, Zijun Deng et al. [Paper] [Code]
    • ⭐⭐(arXiv preprint 2023) Expressive Text-to-Image Generation with Rich Text, Songwei Ge et al. [Paper] [Code] [Project] [Demo]
    • (arXiv preprint 2023) [💬Human Preferences] ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, Jiazheng Xu et al. [Paper] [Code]
    • (arXiv preprint 2023) eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers, Yogesh Balaji et al. [Paper] [Project]
    • (CVPR 2023) GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis, Ming Tao et al. [Paper] [Code]
    • (CVPR 2023) [💬Human Evaluation] Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation, Mayu Otani et al. [Paper]
    • (arXiv preprint 2023) Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models, Lukas Höllein et al. [Paper] [Project] [Code] [Video]
    • (arXiv preprint 2023) Editing Implicit Assumptions in Text-to-Image Diffusion Models, Hadas Orgad et al. [Paper] [Project] [Code]
    • ⭐⭐(arXiv preprint 2023) Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models, Chenfei Wu et al. [Paper] [Code]
    • (arXiv preprint 2023) X&Fuse: Fusing Visual Information in Text-to-Image Generation, Yuval Kirstain et al. [Paper]
    • (CVPR 2023) [💬Stable Diffusion with Brain] High-resolution image reconstruction with latent diffusion models from human brain activity, Yu Takagi et al. [Paper] [Project] [Code]
    • ⭐⭐(arXiv preprint 2023) Universal Guidance for Diffusion Models, Arpit Bansal et al. [Paper] [Code]
    • ⭐(arXiv preprint 2023) Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Hila Chefer et al. [Paper] [Project] [Code]
    • (BMVC 2023) Divide & Bind Your Attention for Improved Generative Semantic Nursing, Yumeng Li et al. [Paper] [Project] [Code]
    • (IEEE Transactions on Multimedia) ALR-GAN: Adaptive Layout Refinement for Text-to-Image Synthesis, Hongchen Tan et al. [Paper]
    • ⭐(CVPR 2023) Multi-Concept Customization of Text-to-Image Diffusion, Nupur Kumari et al. [Paper] [Project] [Code] [Hugging Face]
    • (CVPR 2023) GLIGEN: Open-Set Grounded Text-to-Image Generation, Yuheng Li et al. [Paper] [Code] [Project] [Hugging Face Demo]
    • (arXiv preprint 2023) Attribute-Centric Compositional Text-to-Image Generation, Yuren Cong et al. [Paper] [Project]
    • (arXiv preprint 2023) Muse: Text-To-Image Generation via Masked Generative Transformers, Huiwen Chang et al. [Paper] [Project]

<🎯Back to Top>

6. Other Related Works

  • 📝Prompt Engineering📝
    • (arXiv preprint 2023) [💬Optimizing Prompts] NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation, Shachar Rosenman et al. [Paper] [Video Demo]
    • (arXiv preprint 2022) [💬Optimizing Prompts] Optimizing Prompts for Text-to-Image Generation, Yaru Hao et al. [Paper] [Code] [Hugging Face]
    • (arXiv preprint 2022) [💬Aesthetic Image Generation] Best Prompts for Text-to-Image Models and How to Find Them, Nikita Pavlichenko et al. [Paper]
    • (arXiv preprint 2022) A Taxonomy of Prompt Modifiers for Text-To-Image Generation, Jonas Oppenlaender [Paper]
    • (CHI 2022) Design Guidelines for Prompt Engineering Text-to-Image Generative Models, Vivian Liu et al. [Paper]

<🎯Back to Top>

  • ⭐Multimodality⭐
    • (ICLR 2024) Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing, Ling Yang et al. [Paper] [Code]
      • 📚 Text → Image, Text → Video
    • (arXiv preprint 2024) TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages, Minsu Kim et al. [Paper]
      • 📚 Image → Text, Image → Speech, Text → Image, Speech → Image, Speech → Text, Text → Speech
    • ⭐⭐(arXiv preprint 2023) Any-to-Any Generation via Composable Diffusion, Zineng Tang et al. [Paper] [Project] [Code]
      • 📚[Single-to-Single Generation] Text → Image, Audio → Image, Image → Video, Image → Audio, Audio → Text, Image → Text
      • 📚[Multi-Outputs Joint Generation] Text → Video + Audio, Text → Text + Audio + Image, Text + Image → Text + Image
      • 📚[Multiple Conditioning] Text + Audio → Image, Text + Image → Image, Text + Audio + Image → Image, Text + Audio → Video, Text + Image → Video, Video + Audio → Text, Image + Audio → Audio, Text + Image → Audio
    • ⭐⭐(CVPR 2023) ImageBind: One Embedding Space To Bind Them All, Rohit Girdhar et al. [Paper] [Project] [Code]
      • 📚Image-to-Audio retrieval, Audio-to-Image retrieval, Text-to-Image+Audio, Audio+Image-to-Image, Audio-to-Image generation, Zero-shot text to audio retrieval and classification...
    • ⭐(CVPR 2023) Scaling up GANs for Text-to-Image Synthesis, Minguk Kang et al. [Paper] [Project]
      • 📚Text-to-Image, Controllable image synthesis (Style Mixing, Prompt Interpolation, Prompt Mixing), Super Resolution (Text-conditioned, Unconditional)
    • (arXiv preprint 2023) TextIR: A Simple Framework for Text-based Editable Image Restoration, Yunpeng Bai et al. [Paper] [Code]
      • 📚Image Inpainting, Image Colorization, Image Super-resolution, Image Editing via Degradation
    • (arXiv preprint 2023) Modulating Pretrained Diffusion Models for Multimodal Image Synthesis, Cusuh Ham et al. [Paper]
      • 📚Sketch-to-Image, Segmentation-to-Image, Text+Sketch-to-Image, Text+Segmentation-to-Image, Text+Sketch+Segmentation-to-Image
    • (arXiv preprint 2023) Muse: Text-To-Image Generation via Masked Generative Transformers, Huiwen Chang et al. [Paper] [Project]
      • 📚Text-to-Image, Zero-shot+Mask-free editing, Zero-shot Inpainting/Outpainting
    • (arXiv preprint 2022) Versatile Diffusion: Text, Images and Variations All in One Diffusion Model, Xingqian Xu et al. [Paper] [Code] [Hugging Face]
      • 📚Text-to-Image, Image-Variation, Image-to-Text, Disentanglement, Text+Image-Guided Generation, Editable I2T2I
    • (arXiv preprint 2022) Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis, Wan-Cyuan Fan et al. [Paper] [Code]
      • 📚Text-to-Image, Scene Gragh to Image, Layout-to-Image, Uncondition Image Generation
    • (arXiv preprint 2022) NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis, Chenfei Wu et al. [Paper] [Code] [Project]
      • 📚Unconditional Image Generation(HD), Text-to-Image(HD), Image Animation(HD), Image Outpainting(HD), Text-to-Video(HD)
    • (ECCV 2022) NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion, Chenfei Wu et al. [Paper] [Code]
      • Multimodal Pretrained Model for Multi-tasks🎄: Text-To-Image, Sketch-to-Image, Image Completion, Text-Guided Image Manipulation, Text-to-Video, Video Prediction, Sketch-to-Video, Text-Guided Video Manipulation
    • (ACMMM 2022) Rethinking Super-Resolution as Text-Guided Details Generation, Chenxi Ma et al. [Paper]
      • 📚Text-to-Image, High-resolution, Text-guided High-resolution
    • (arXiv preprint 2022) Discrete Contrastive Diffusion for Cross-Modal and Conditional Generation, Ye Zhu et al. [Paper] [Code]
      • 📚Text-to-Image, Dance-to-Music, Class-to-Image
    • (arXiv preprint 2022) M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing, Zhikang Li et al. [Paper]
      • 📚Text-to-Image, Unconditional Image Generation, Local-editing, Text-guided Local-editing, In/Out-painting, Style-mixing
    • (CVPR 2022) Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning, Yogesh Balaji et al. [Paper] [Code] Project
      • 📚Text-to-Video, Independent Multimodal Controls, Dependent Multimodal Controls
    • ⭐⭐(CVPR 2022) High-Resolution Image Synthesis with Latent Diffusion Models, Robin Rombach et al. [Paper] [Code] [Stable Diffusion Code]
      • 📚Text-to-Image, Conditional Latent Diffusion, Super-Resolution, Inpainting
    • ⭐⭐(arXiv preprint 2022) Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, Peng Wang et al. [Paper] [Code] [Hugging Face]
      • 📚Text-to-Image Generation, Image Captioning, Text Summarization, Self-Supervised Image Classification, [SOTA] Referring Expression Comprehension, Visual Entailment, Visual Question Answering
    • (arXiv preprint 2021) Multimodal Conditional Image Synthesis with Product-of-Experts GANs, Xun Huang et al. [Paper] [Project]
      • 📚Text-to-Image, Segmentation-to-Image, Text+Segmentation/Sketch/Image→Image, Sketch+Segmentation/Image→Image, Segmentation+Image→Image
    • (NeurIPS 2021) M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis via Non-Autoregressive Generative Transformers, Zhu Zhang et al. [Paper]
      • 📚Text-to-Image, Sketch-to-Image, Style Transfer, Image Inpainting, Multi-Modal Control to Image
    • (arXiv preprint 2021) ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation, Han Zhang et al. [Paper]
      • A pre-trained 10-billion parameter model: ERNIE-ViLG.
      • A large-scale dataset of 145 million high-quality Chinese image-text pairs.
      • 📚Text-to-Image, Image Captioning, Generative Visual Question Answering
    • (arXiv preprint 2021) Multimodal Conditional Image Synthesis with Product-of-Experts GANs, Xun Huang et al. [Paper] [Project]
      • 📚Text-to-Image, Segmentation-to-Image, Text+Segmentation/Sketch/Image → Image, Sketch+Segmentation/Image → Image, Segmentation+Image → Image
    • (arXiv preprint 2021) L-Verse: Bidirectional Generation Between Image and Text, Taehoon Kim et al. [Paper] [Code]
      • 📚Text-To-Image, Image-To-Text, Image Reconstruction
    • (arXiv preprint 2021) [💬Semantic Diffusion Guidance] More Control for Free! Image Synthesis with Semantic Diffusion Guidance, Xihui Liu et al. [Paper] [Project]
      • 📚Text-To-Image, Image-To-Image, Text+Image → Image

<🎯Back to Top>

  • 🛫Applications🛫
    • (arXiv preprint 2024) [💬Multi-Concept Composition] Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition, Chun-Hsiao Yeh et al. [Paper] [Project] [Code]
    • (arXiv preprint 2023) [💬3D Hairstyle Generation] HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles, Vanessa Sklyarova et al. [Paper] [Project]
    • (arXiv preprint 2023) [💬Image Super-Resolution] Image Super-Resolution with Text Prompt Diffusion, Zheng Chen et al. [Paper] [Code]
    • (2023) [💬Image Editing] Generative Fill. [Project]
    • (arXiv preprint 2023) [💬LLMs] LLM as an Art Director (LaDi): Using LLMs to improve Text-to-Media Generators, Allen Roush et al. [Paper]
    • (arXiv preprint 2023) [💬Segmentation] SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis, Hanrong Ye et al. [Paper] [Project]
    • (arXiv preprint 2023) [💬Text Editing] DiffUTE: Universal Text Editing Diffusion Model, Haoxing Chen et al. [Paper]
    • (arXiv preprint 2023) [💬Text Character Generation] TextDiffuser: Diffusion Models as Text Painters, Jingye Chen et al. [Paper]
    • (CVPR 2023) [💬Open-Vocabulary Panoptic Segmentation] Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models, Jiarui Xu et al. [Paper] [Code] Project] HuggingFace]
    • (arXiv preprint 2023) [💬Chinese Text Character Generation] GlyphDraw: Learning to Draw Chinese Characters in Image Synthesis Models Coherently, Jian Ma et al. [Paper] [Project]
    • (arXiv preprint 2023) [💬Grounded Generation] Guiding Text-to-Image Diffusion Model Towards Grounded Generation, Ziyi Li et al. [Paper] [Code] Project]
    • (arXiv preprint 2022) [💬Semantic segmentation] CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation, Yuqi Lin et al. [Paper] [Code]
    • (arXiv preprint 2022) [💬Unsupervised semantic segmentation] Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors, Ryan Burgert et al. [Paper]
    • (SIGGRAPH Asia 2022) [💬Text+Speech → Gesture] Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings, Tenglong Ao et al. [Paper] [Code]
    • (arXiv preprint 2022) [💬Text+Image+Shape → Image] Shape-Guided Diffusion with Inside-Outside Attention, Dong Huk Park et al. [Paper] [Project]

<🎯Back to Top>

  • Text+Image/Video → Image/Video
    • (CVPR 2024) Instruct-Imagen: Image Generation with Multi-modal Instruction, Hexiang Hu et al. [Paper] [Project]
    • (arXiv preprint 2024) [💬NERF] InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes, Mohamad Shahbazi et al. [Paper] [Project]
    • (arXiv preprint 2023) ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation, Shaozhe Hao et al. [Paper] [Code]
    • (arXiv preprint 2023) [💬Video Editing] MagicStick: Controllable Video Editing via Control Handle Transformations, Yue Ma et al. [Paper] [Project] [Code]
    • (arXiv preprint 2023) Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models, Chen Henry Wu et al. [Paper]
    • (ACMMM 2023) [💬Style Transfer] ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors, Jingwen Chen et al. [Paper]
    • (ICCV 2023) A Latent Space of Stochastic Diffusion Models for Zero-Shot Image Editing and Guidance, Chen Henry Wu et al. [Paper] [Arxiv] [Code]
    • (arXiv preprint 2023) [💬Multi-Subject Generation] VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning, Hong Chen et al. [Paper] [Project] [Code]
    • (arXiv preprint 2023) [💬Video Editing] CCEdit: Creative and Controllable Video Editing via Diffusion Models, Ruoyu Feng et al. [Paper] [Demo video]
    • ⭐⭐ (SIGGRAPH Asia 2023) Break-A-Scene: Extracting Multiple Concepts from a Single Image, Omri Avrahami et al. [Paper] [Project] [Code]
    • (arXiv preprint 2023) Visual Instruction Inversion: Image Editing via Visual Prompting, Thao Nguyen et al. [Paper] [Project]
    • (CVPR 2023) [💬3D Shape Editing] ShapeTalk: A Language Dataset and Framework for 3D Shape Edits and Deformations, Panos Achlioptas et al. [Paper] [Code] [Project]
    • (arXiv preprint 2023) [💬Colorization] DiffColor: Toward High Fidelity Text-Guided Image Colorization with Diffusion Models, Jianxin Lin et al. [Paper]
    • (ICCV 2023) [💬Video Editing] FateZero: Fusing Attentions for Zero-shot Text-based Video Editing, Chenyang Qi et al. [Paper] [Code] [Project] Hugging Face]
    • (arXiv preprint 2023) [💬3D] AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose, Huichao Zhang et al. [Paper] [Project]
    • (ACM Transactions on Graphics 2023) CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing, Ahmet Canberk Baykal et al. [Paper]
    • (arXiv preprint 2023) ⭐⭐AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning, Yuwei Guo et al. [Paper] [Project] [Code]
    • (ICLR 2023) DiffEdit: Diffusion-based semantic image editing with mask guidance, Guillaume Couairon et al. [Paper]
    • (arXiv preprint 2023) Controlling Text-to-Image Diffusion by Orthogonal Finetuning, Zeju Qiu et al. [Paper] [Project] [Code]
    • (arXiv preprint 2023) [💬Reject Human Instructions] Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation, Zhiwei Zhang et al. [Paper] [Project] [Code]
    • (arXiv preprint 2023) MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation, Marco Bellagente et al. [Paper]
    • (CVPR 2023) Text-Guided Unsupervised Latent Transformation for Multi-Attribute Image Manipulation, Xiwen Wei et al. [Paper]
    • (arXiv preprint 2023) Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models, Shihao Zhao et al. [Paper] [Project]
    • (arXiv preprint 2023) Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation, Yiyang Ma et al. [Paper]
    • (arXiv preprint 2023) DisenBooth: Disentangled Parameter-Efficient Tuning for Subject-Driven Text-to-Image Generation, Hong Chen et al. [Paper]
    • (arXiv preprint 2023) [💬Image Editing] Guided Image Synthesis via Initial Image Editing in Diffusion Model, Jiafeng Mao et al. [Paper]
    • (arXiv preprint 2023) [💬Image Editing] Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models, Wenkai Dong et al. [Paper]
    • (CVPR 2023) DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, Nataniel Ruiz et al. [Paper] [Project]
    • (arXiv preprint 2023) Shape-Guided Diffusion with Inside-Outside Attention, Dong Huk Park et al. [Paper] [Code] [Project] Hugging Face]
    • (arXiv preprint 2023) [💬Image Editing] iEdit: Localised Text-guided Image Editing with Weak Supervision, Rumeysa Bodur et al. [Paper]
    • (PR 2023) [💬Person Re-identification] BDNet: A BERT-based Dual-path Network for Text-to-Image Cross-modal Person Re-identification, Qiang Liu et al. [Paper]
    • (arXiv preprint 2023) MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models, Jing Zhao et al. [Paper] [Code] Project]
    • (CVPR 2023) [💬3D] TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision, Jiacheng Wei et al. [Paper]
    • ⭐⭐(arXiv preprint 2023) [💬Image Editing] MasaCtrl: Tuning-free Mutual Self-Attention Control for Consistent Image Synthesis and Editing, Mingdeng Cao et al. [Paper] [Code] [Project]
    • (arXiv preprint 2023) Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos, Yue Ma et al. [Paper] [Code] [Hugging Face]
    • ⭐⭐(arXiv preprint 2023) [💬Image Editing] Delta Denoising Score, Amir Hertz et al. [Paper]
    • (arXiv preprint 2023) Subject-driven Text-to-Image Generation via Apprenticeship Learning, Wenhu Chen et al. [Paper] [Project]
    • (arXiv preprint 2023) [💬Image Editing] Region-Aware Diffusion for Zero-shot Text-driven Image Editing, Nisha Huang et al. [Paper] [Code]
    • ⭐⭐(arXiv preprint 2023) [💬Text+Video → Video]Structure and Content-Guided Video Synthesis with Diffusion Models, Patrick Esser et al. [Paper] [Project]
    • (arXiv preprint 2023) ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation, Yuxiang Wei et al. [Paper]
    • (arXiv preprint 2023) [💬Fashion Image Editing] FICE: Text-Conditioned Fashion Image Editing With Guided GAN Inversion, Martin Pernuš et al. [Paper] [Code]
    • (AAAI 2023) CLIPVG: Text-Guided Image Manipulation Using Differentiable Vector Graphics, Yiren Song et al. [Paper]
    • (AAAI 2023) DE-Net: Dynamic Text-guided Image Editing Adversarial Networks, Ming Tao et al. [Paper] [Code]
    • (arXiv preprint 2022) Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation, Narek Tumanyan et al. [Paper] [Project]
    • (arXiv preprint 2022) [💬Text+Image → Video] Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation, Tsu-Jui Fu et al. [Paper]
    • (arXiv preprint 2022) [💬Image Stylization] DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization, Nisha Huang et al. [Paper] [Code]
    • (arXiv preprint 2022) Null-text Inversion for Editing Real Images using Guided Diffusion Models, Ron Mokady et al. [Paper] [Project]
    • (arXiv preprint 2022) InstructPix2Pix: Learning to Follow Image Editing Instructions, Tim Brooks et al. [Paper] [Project]
    • (ECCV 2022) [💬Style Transfer] Language-Driven Artistic Style Transfer, Tsu-Jui Fu et al. [Paper] [Code]
    • (arXiv preprint 2022) Bridging CLIP and StyleGAN through Latent Alignment for Image Editing, Wanfeng Zheng et al. [Paper]
    • (NeurIPS 2022) One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations, Yiming Zhu et al. [Paper] [Code]
    • (BMVC 2022) LDEdit: Towards Generalized Text Guided Image Manipulation via Latent Diffusion Models, Paramanand Chandramouli et al. [Paper]
    • (ACMMM 2022) [💬Iterative Language-based Image Manipulation] LS-GAN: Iterative Language-based Image Manipulation via Long and Short Term Consistency Reasoning, Gaoxiang Cong et al. [Paper]
    • (ACMMM 2022) [💬Digital Art Synthesis] Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion, Huang Nisha et al. [Paper] [Code]
    • (SIGGRAPH Asia 2022) [💬HDR Panorama Generation] Text2Light: Zero-Shot Text-Driven HDR Panorama Generation, Zhaoxi Chen et al. [Paper] [Project] [Code]
    • (arXiv preprint 2022) LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data, Jihye Park et al. [Paper] [Project] [Code]
    • (ACMMM PIES-ME 2022) [💬3D Semantic Style Transfer] Language-guided Semantic Style Transfer of 3D Indoor Scenes, Bu Jin et al. [Paper] [Code]
    • (arXiv preprint 2022) [💬Face Animation] Language-Guided Face Animation by Recurrent StyleGAN-based Generator, Tiankai Hang et al. [Paper] [Code]
    • (arXiv preprint 2022) [💬Fashion Design] ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal Fashion Design, Xujie Zhang et al. [Paper] [Code]
    • (arXiv preprint 2022) [💬Image Colorization] TIC: Text-Guided Image Colorization, Subhankar Ghosh et al. [Paper]
    • (ECCV 2022) [💬Animating Human Meshes] CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes, Kim Youwang et al. [Paper] [Code]
    • (ECCV 2022) [💬Pose Synthesis] TIPS: Text-Induced Pose Synthesis, Prasun Roy et al. [Paper] [Code] [Project]
    • (ACMMM 2022) [💬Person Re-identification] Learning Granularity-Unified Representations for Text-to-Image Person Re-identification, Zhiyin Shao et al. [Paper] [Code]
    • (ACMMM 2022) Towards Counterfactual Image Manipulation via CLIP, Yingchen Yu et al. [Paper] [Code]
    • (ACMMM 2022) [💬Monocular Depth Estimation] Can Language Understand Depth?, Wangbo Zhao et al. [Paper] [Code]
    • (arXiv preprint 2022) [💬Image Style Transfer] Referring Image Matting, Tsu-Jui Fu et al. [Paper]
    • (CVPR 2022) [💬Image Segmentation] Image Segmentation Using Text and Image Prompts, Timo Lüddecke et al. [Paper] [Code]
    • (CVPR 2022) [💬Video Segmentation] Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation, Wangbo Zhao et al. [Paper] [Code]
    • (arXiv preprint 2022) [💬Image Matting] Referring Image Matting, Sebastian Loeschcke et al. [Paper] [Dataset]
    • (arXiv preprint 2022) [💬Stylizing Video Objects] Text-Driven Stylization of Video Objects, Sebastian Loeschcke et al. [Paper] [Project]
    • (arXiv preprint 2022) DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection, Yunhao Ge et al. [Paper]
    • (IEEE Transactions on Neural Networks and Learning Systems 2022) [💬Pose-Guided Person Generation] Verbal-Person Nets: Pose-Guided Multi-Granularity Language-to-Person Generation, Deyin Liu et al. [Paper]
    • (SIGGRAPH 2022) [💬3D Avatar Generation] AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars, Fangzhou Hong et al. [Paper] [Code] [Project]
    • ⭐⭐(arXiv preprint 2022) [💬Image & Video Editing] Text2LIVE: Text-Driven Layered Image and Video Editing, Omer Bar-Tal et al. [Paper] [Project]
    • (Machine Vision and Applications 2022) Paired-D++ GAN for image manipulation with text, Duc Minh Vo et al. [Paper]
    • (CVPR 2022) [💬Hairstyle Transfer] HairCLIP: Design Your Hair by Text and Reference Image, Tianyi Wei et al. [Paper] [Code]
    • (CVPR 2022) [💬NeRF] CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields, Can Wang et al. [Paper] [Code] [Project]
    • (CVPR 2022) DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation, Gwanghyun Kim et al. [Paper]
    • (CVPR 2022) ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation, Jianan Wang et al. [Paper] [Project]
    • ⭐⭐ (CVPR 2022) Blended Diffusion for Text-driven Editing of Natural Images, Omri Avrahami et al. [Paper] [Code] [Project]
    • (CVPR 2022) Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model, Zipeng Xu et al. [Paper] [Code]
    • (CVPR 2022) [💬Style Transfer] CLIPstyler: Image Style Transfer with a Single Text Condition, Gihyun Kwon et al. [Paper] [Code]
    • (arXiv preprint 2022) [💬Multi-person Image Generation] Pose Guided Multi-person Image Generation From Text, Soon Yau Cheong et al. [Paper]
    • (arXiv preprint 2022) [💬Image Style Transfer] StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation, Peter Schaldenbrand et al. [Paper] [Dataset] [Code] [Demo]
    • (arXiv preprint 2022) [💬Image Style Transfer] Name Your Style: An Arbitrary Artist-aware Image Style Transfer, Zhi-Song Liu et al. [Paper]
    • (arXiv preprint 2022) [💬3D Avatar Generation] Text and Image Guided 3D Avatar Generation and Manipulation, Zehranaz Canfes et al. [Paper] [Project]
    • (arXiv preprint 2022) [💬Image Inpainting] NÜWA-LIP: Language Guided Image Inpainting with Defect-free VQGAN, Minheng Ni et al. [Paper]
    • ⭐(arXiv preprint 2021) [💬Text+Image → Video] Make It Move: Controllable Image-to-Video Generation with Text Descriptions, Yaosi Hu et al. [Paper]
    • (arXiv preprint 2021) [💬NeRF] Zero-Shot Text-Guided Object Generation with Dream Fields, Ajay Jain et al. [Paper] [Project]
    • (NeurIPS 2021) Instance-Conditioned GAN, Arantxa Casanova et al. [Paper] [Code]
    • (ICCV 2021) Language-Guided Global Image Editing via Cross-Modal Cyclic Mechanism, Wentao Jiang et al. [Paper]
    • (ICCV 2021) Talk-to-Edit: Fine-Grained Facial Editing via Dialog, Yuming Jiang et al. [Paper] [Project] [Code]
    • (ICCVW 2021) CIGLI: Conditional Image Generation from Language & Image, Xiaopeng Lu et al. [Paper] [Code]
    • (ICCV 2021) StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery, Or Patashnik et al. [Paper] [Code]
    • (arXiv preprint 2021) Paint by Word, David Bau et al. [Paper]
    • ⭐(arXiv preprint 2021) Zero-Shot Text-to-Image Generation, Aditya Ramesh et al. [Paper] [Code] [Blog] [Model Card] [Colab]
    • (NeurIPS 2020) Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation, Bowen Li et al. [Paper]
    • (CVPR 2020) ManiGAN: Text-Guided Image Manipulation, Bowen Li et al. [Paper] [Code]
    • (ACMMM 2020) Text-Guided Neural Image Inpainting, Lisai Zhang et al. [Paper] [Code]
    • (ACMMM 2020) Describe What to Change: A Text-guided Unsupervised Image-to-Image Translation Approach, Yahui Liu et al. [Paper]
    • (NeurIPS 2018) Text-adaptive generative adversarial networks: Manipulating images with natural language, Seonghyeon Nam et al. [Paper] [Code]

<🎯Back to Top>

  • Text+Layout → Image
    • (CVPR 2024) MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis, Dewei Zhou et al. [Paper] [Project] [Code]
    • (ICLR 2024) Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive, Yumeng Li et al. [Paper] [Project] [Code]
    • (ICCV 2023) Dense Text-to-Image Generation with Attention Modulation, Yunji Kim et al. [Paper] [Code]
    • (arXiv preprint 2023) Training-Free Layout Control with Cross-Attention Guidance, Minghao Chen et al. [Paper] [Code] [Project]

<🎯Back to Top>

  • Audio+Text+Image/Video → Image/Video
    • (arXiv preprint 2023) [💬Sound+Speech→Robotic Painting] Robot Synesthesia: A Sound and Emotion Guided AI Painter, Vihaan Misra et al. [Paper]
    • (arXiv preprint 2022) Robust Sound-Guided Image Manipulation, Seung Hyun Lee et al. [Paper]

<🎯Back to Top>

  • Layout/Mask → Image
    • (CVPR 2024) [💬Instance information +Text→Image] InstanceDiffusion: Instance-level Control for Image Generation, XuDong Wang et al. [Paper] [Project] [Code]
    • (arXiv preprint 2023) [💬Text→Layout→Image] LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation, Leigang Qu et al. [Paper]
    • (CVPR 2023) [💬Mask+Text→Image] SceneComposer: Any-Level Semantic Image Synthesis, Yu Zeng et al. [Paper] [Demo]
    • (CVPR 2023) Freestyle Layout-to-Image Synthesis, Han Xue et al. [Paper] [Code]
    • (CVPR 2023) LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation, Guangcong Zheng et al. [Paper] [Code]
    • (Journal of King Saud University - Computer and Information Sciences) [Survey] Image Generation Models from Scene Graphs and Layouts: A Comparative Analysis, Muhammad Umair Hassan et al. [Paper]
    • (CVPR 2022) Modeling Image Composition for Complex Scene Generation, Zuopeng Yang et al. [Paper] [Code]
    • (CVPR 2022) Interactive Image Synthesis with Panoptic Layout Generation, Bo Wang et al. [Paper]
    • (CVPR 2021 AI for Content Creation Workshop) High-Resolution Complex Scene Synthesis with Transformers, Manuel Jahn et al. [Paper]
    • (CVPR 2021) Context-Aware Layout to Image Generation with Enhanced Object Appearance, Sen He et al. [Paper] [Code]

<🎯Back to Top>

  • Label-set → Semantic maps
    • (ECCV 2020) Controllable image synthesis via SegVAE, Yen-Chi Cheng et al. [Paper] [Code]

<🎯Back to Top>

  • Speech → Image
    • (IEEE/ACM Transactions on Audio, Speech and Language Processing-2021) Generating Images From Spoken Descriptions, Xinsheng Wang et al. [Paper] [Code] [Project]
    • (INTERSPEECH 2020)[Extent Version👆] S2IGAN: Speech-to-Image Generation via Adversarial Learning, Xinsheng Wang et al. [Paper]
    • (IEEE Journal of Selected Topics in Signal Processing-2020) Direct Speech-to-Image Translation, Jiguo Li et al. [Paper] [Code] [Project]

<🎯Back to Top>

  • Scene Graph → Image
    • (arXiv preprint 2023) Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training, Ling Yang et al. [Paper]
    • (CVPR 2018) Image Generation from Scene Graphs, Justin Johnson et al. [Paper] [Code]

<🎯Back to Top>

  • Text → Visual Retrieval
    • (ECIR 2023) Scene-Centric vs. Object-Centric Image-Text Cross-Modal Retrieval: A Reproducibility Study, Mariya Hendriksen et al. [Paper] [Code]
    • (ECIR 2022) Extending CLIP for Category-to-image Retrieval in E-commerce, Mariya Hendriksen et al. [Paper] [Code]
    • (ACMMM 2022) CAIBC: Capturing All-round Information Beyond Color for Text-based Person Retrieval, Zijie Wang et al. [Paper]
    • (AAAI 2022) Cross-Modal Coherence for Text-to-Image Retrieval, Malihe Alikhani et al. [Paper]
    • (ECCV RWS 2022) [💬Person Retrieval] See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval, Xiujun Shu et al. [Paper] [Code]
    • (ECCV 2022) [💬Text+Sketch→Visual Retrieval] A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch, Patsorn Sangkloy et al. [Paper] [Project]
    • (Neurocomputing-2022) TIPCB: A simple but effective part-based convolutional baseline for text-based person search, Yuhao Chen et al. [Paper] [Code]
    • (arXiv preprint 2021) [💬Dataset] FooDI-ML: a large multi-language dataset of food, drinks and groceries images and descriptions, David Amat Olóndriz et al. [Paper] [Code]
    • (CVPRW 2021) TIED: A Cycle Consistent Encoder-Decoder Model for Text-to-Image Retrieval, Clint Sebastian et al. [Paper]
    • (CVPR 2021) T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval, Xiaohan Wang et al. [Paper]
    • (CVPR 2021) Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers, Antoine Miech et al. [Paper]
    • (IEEE Access 2019) Query is GAN: Scene Retrieval With Attentional Text-to-Image Generative Adversarial Network, RINTARO YANAGI et al. [Paper]

<🎯Back to Top>

  • Text → 3D/Motion/Shape/Mesh/Object...
    • (IEEE Transactions on Visualization and Computer Graphics) [💬Text → Motion] GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion Generation, Xuehao Gao et al. [Paper]
    • (arXiv preprint 2023) [💬Text → 4D] 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling, Sherwin Bahmani et al. [Paper] [Project] [Code]
    • (arXiv preprint 2023) [💬Text → 3D] MetaDreamer: Efficient Text-to-3D Creation With Disentangling Geometry and Texture, Lincong Feng et al. [Paper] [Project]
    • (arXiv preprint 2023) [💬Text → 3D] One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion, Minghua Liu et al. [Paper] [Project]
    • (NeurIPS 2023) [💬Text → 3D] One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization, Minghua Liu et al. [Paper] [Project] [Code]
    • (ACMMM 2023) [💬Text+Sketch → 3D] Control3D: Towards Controllable Text-to-3D Generation, Yang Chen et al. [Paper]
    • (SIGGRAPH Asia 2023 & TOG) [💬Text → 3D] EXIM: A Hybrid Explicit-Implicit Representation for Text-Guided 3D Shape Generation, Zhengzhe Liu et al. [Paper] [Code]
    • (arXiv preprint 2023) [💬Text → 3D] PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation, Jianhui Yu et al. [Paper]
    • (arXiv preprint 2023) [💬Text → Motion] Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion Model, Yin Wang et al. [Paper]
    • (arXiv preprint 2023) [💬Text → 3D] IT3D: Improved Text-to-3D Generation with Explicit View Synthesis, Yiwen Chen et al. [Paper] [Code]
    • (arXiv preprint 2023) [💬Text → 3D] HD-Fusion: Detailed Text-to-3D Generation Leveraging Multiple Noise Estimation, Jinbo Wu et al. [Paper]
    • (arXiv preprint 2023) [💬Text → 3D] T2TD: Text-3D Generation Model based on Prior Knowledge Guidance, Weizhi Nie et al. [Paper]
    • (arXiv preprint 2023) [💬Text → 3D] ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation, Zhengyi Wang et al. [Paper] [Project]
    • (arXiv preprint 2023) [💬Text+Mesh → Mesh] X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance, Yiwei Ma et al. [Paper] [Project] [Code]
    • (arXiv preprint 2023) [💬Text → Motion] T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations, Jianrong Zhang et al. [Paper] [Project] [Code] [Hugging Face]
    • (arXiv preprint 2023) [💬Text → 3D] DreamHuman: Animatable 3D Avatars from Text, Nikos Kolotouros et al. [Paper] [Project]
    • (arXiv preprint 2023) [💬Text → 3D] ATT3D: Amortized Text-to-3D Object Synthesis, Jonathan Lorraine et al. [Paper] [Project]
    • (arXiv preprint 2022) [💬Text → 3D] Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models, Jiale Xu et al. [Paper] [Project]
    • (arXiv preprint 2022) [💬3D Generative Model] DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model, Gwanghyun Kim et al. [Paper] [Code] [Project]
    • (arXiv preprint 2022) [💬Point Clouds] Point-E: A System for Generating 3D Point Clouds from Complex Prompts, Alex Nichol et al. [Paper] [Code]
    • (arXiv preprint 2022) [💬Text → 3D] Magic3D: High-Resolution Text-to-3D Content Creation, Chen-Hsuan Lin et al. [Paper] [Project]
    • (arXiv preprint 2022) [💬Text → Shape] Diffusion-SDF: Text-to-Shape via Voxelized Diffusion, Muheng Li et al. [Paper] [Code]
    • (NIPS 2022) [💬Mesh] TANGO: Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition, Yongwei Chen et al. [Paper] [Project] [Code]
    • (arXiv preprint 2022) [💬Human Motion Generation] Human Motion Diffusion Model, Guy Tevet et al. [Paper] [Project] [Code]
    • (arXiv preprint 2022) [💬Human Motion Generation] MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model, Mingyuan Zhang et al. [Paper] [Project]
    • (arXiv preprint 2022) [💬3D Shape] ISS: Image as Stetting Stone for Text-Guided 3D Shape Generation, Zhengzhe Liu et al. [Paper]
    • (ECCV 2022) [💬Virtual Humans] Compositional Human-Scene Interaction Synthesis with Semantic Control, Kaifeng Zhao et al. [Paper] [Project] [Code]
    • (CVPR 2022) [💬3D Shape] Towards Implicit Text-Guided 3D Shape Generation, Zhengzhe Liu et al. [Paper] [Code]
    • (CVPR 2022) [💬Object] Zero-Shot Text-Guided Object Generation with Dream Fields, Ajay Jain et al. [Paper] [Project] [Code]
    • (CVPR 2022) [💬Mesh] Text2Mesh: Text-Driven Neural Stylization for Meshes, Oscar Michel et al. [Paper] [Project] [Code]
    • (CVPR 2022) [💬Motion] Generating Diverse and Natural 3D Human Motions from Text, Chuan Guo et al. [Paper] [Project] [Code]
    • (CVPR 2022) [💬Shape] CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation, Aditya Sanghi et al. [Paper] [Code]
    • (arXiv preprint 2022) [💬Motion] TEMOS: Generating diverse human motions from textual descriptions, Mathis Petrovich et al. [Paper] [Project] [Code]

<🎯Back to Top>

  • Text → Video
    • 💥💥(OpenAI 2024) Sora [Homepage] [Technical Report] [Sora with Audio]
    • (arXiv preprint 2023) [💬Music Visualization] Generative Disco: Text-to-Video Generation for Music Visualization, Vivian Liu et al. [Paper]
    • (arXiv preprint 2024) MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation, Weimin Wang et al. [Paper] [Project]
    • (arXiv preprint 2023) LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models, Yaohui Wang et al. [Paper] [Project] [Code]
    • (arXiv preprint 2023) Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning, Rohit Girdhar et al. [Paper] [Project]
    • (ICCV 2023) Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators, Levon Khachatryan et al. [Paper] [Project] [Video] [Code] [Hugging Face]
    • (NeurIPS 2023 Datasets and Benchmarks) FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation, Yuanxin Liu et al. [Paper] [Project]
    • (arXiv preprint 2023) Optimal Noise pursuit for Augmenting Text-to-Video Generation, Shijie Ma et al. [Paper]
    • (arXiv preprint 2023) Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation, Jiaxi Gu et al. [Paper] [Project]
    • (arXiv preprint 2023) Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts, Yuyang Zhao et al. [Paper] [Code] [Project]
      • 📚Image Editing, Background Editing, Text-to-Video Editing with Protagonist
    • ⭐⭐(CVPR 2023) Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models, Andreas Blattmann et al. [Paper] [Project]
    • (arXiv preprint 2023) [💬Music Visualization] Generative Disco: Text-to-Video Generation for Music Visualization, Vivian Liu et al. [Paper]
    • (arXiv preprint 2023) Text-To-4D Dynamic Scene Generation, Uriel Singer et al. [Paper] [Project]
    • (arXiv preprint 2022) Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation, Jay Zhangjie Wu et al. [Paper] [Project] [Code]
    • (arXiv preprint 2022) MagicVideo: Efficient Video Generation With Latent Diffusion Models, Daquan Zhou et al. [Paper] [Project]
    • (arXiv preprint 2022) Phenaki: Variable Length Video Generation From Open Domain Textual Description, Ruben Villegas et al. [Paper]
    • (arXiv preprint 2022) Imagen Video: High Definition Video Generation with Diffusion Models, Jonathan Ho et al. [Paper] [Project]
    • (arXiv preprint 2022) Text-driven Video Prediction, Xue Song et al. [Paper]
    • (arXiv preprint 2022) Make-A-Video: Text-to-Video Generation without Text-Video Data, Uriel Singer et al. [Paper] [Project] [Short read] [Code]
    • (ECCV 2022) [💬Story Continuation] StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation, Adyasha Maharana et al. [Paper] [Code]
    • (arXiv preprint 2022) [💬Story → Video] Word-Level Fine-Grained Story Visualization, Bowen Li et al. [Paper] [Code]
    • (arXiv preprint 2022) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers, Wenyi Hong et al. [Paper] [Code]
    • (CVPR 2022) Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning, Yogesh Balaji et al. [Paper] [Code] Project
    • (arXiv preprint 2022) Video Diffusion Models, Jonathan Ho et al. [Paper] [Project]
    • (arXiv preprint 2021) [❌Genertation Task] Transcript to Video: Efficient Clip Sequencing from Texts, Ligong Han et al. [Paper] [Project]
    • (arXiv preprint 2021) GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions, Chenfei Wu et al. [Paper]
    • (arXiv preprint 2021) Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary, Sibo Zhang et al. [Paper]
    • (IEEE Access 2020) TiVGAN: Text to Image to Video Generation With Step-by-Step Evolutionary Generator, DOYEON KIM et al. [Paper]
    • (IJCAI 2019) Conditional GAN with Discriminative Filter Generation for Text-to-Video Synthesis, Yogesh Balaji et al. [Paper] [Code]
    • (IJCAI 2019) IRC-GAN: Introspective Recurrent Convolutional GAN for Text-to-video Generation, Kangle Deng et al. [Paper]
    • (CVPR 2019) [💬Story → Video] StoryGAN: A Sequential Conditional GAN for Story Visualization, Yitong Li et al. [Paper] [Code]
    • (AAAI 2018) Video Generation From Text, Yitong Li et al. [Paper]
    • (ACMMM 2017) To create what you tell: Generating videos from captions, Yingwei Pan et al. [Paper]

<🎯Back to Top>

  • Text → Music
    • ⭐(arXiv preprint 2023) MusicLM: Generating Music From Text, Andrea Agostinelli et al. [Paper] [Project] [MusicCaps]

<🎯Back to Top>

Contact Me

Star History Chart

If you have any questions or comments, please feel free to contact Yutong ლ(╹◡╹ლ)

Contributors

Alt

Made with contrib.rocks.

About

(ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published