Skip to content

PyTorch implementation of [ThinkSound], a unified framework for generating audio from any modality, guided by Chain-of-Thought (CoT) reasoning.

Notifications You must be signed in to change notification settings

amieruljapri/ThinkSound

 
 

Repository files navigation

ThinkSound

🌐 English | 简体中文 | 繁體中文 | Español | Français | 日本語

arXiv   Online Demo   Hugging Face   ModelScope

If you find this project useful,
a star ⭐ on GitHub would be greatly appreciated!


ThinkSound is a unified Any2Audio generation framework with flow matching guided by Chain-of-Thought (CoT) reasoning.

PyTorch implementation for multimodal audio generation and editing: generate or edit audio from video, text, and audio, powered by step-by-step reasoning from Multimodal Large Language Models (MLLMs).

Teaser

📰 News

  • 2025.07.15   📦 Simplified installation and usability: dependencies on PyPI for easy cross-platform setup; Windows .bat scripts automate environment creation and script running.
  • 2025.07.08    🔧 Major update: model lightweighted and optimized memory and GPU usage, now supports high-throughput audio generation at scale!
  • 2025.07.01   🔥Online demo on Hugging Face Spaces and ModelScope for interactive experience!
  • 2025.07.01   🔥Released inference scripts and web interface;
  • 2025.06   🔥ThinkSound paper released on arXiv!
  • 2025.06   🔥Online Demo is live - try it now!

🚀 Features

  • Any2Audio: Generate audio from arbitrary modalities — video, text, audio, or their combinations.
  • Video-to-Audio SOTA: Achieves state-of-the-art results on multiple V2A benchmarks.
  • CoT-Driven Reasoning: Chain-of-Thought reasoning for compositional and controllable audio generation via MLLMs.
  • Interactive Object-centric Editing: Refine or edit specific sound events by clicking on visual objects or using text instructions.
  • Unified Framework: One foundation model supports generation, editing, and interactive workflow.

✨ Method Overview

ThinkSound decomposes audio generation and editing into three interactive stages, all guided by MLLM-based Chain-of-Thought (CoT) reasoning:

  1. Foley Generation: Generate foundational, semantically and temporally aligned soundscapes from video.
  2. Object-Centric Refinement: Refine or add sounds for user-specified objects via clicks or regions in the video.
  3. Targeted Audio Editing: Modify generated audio using high-level natural language instructions.

ThinkSound Overview


⚡ Quick Start

Environment Preparation:

git clone https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound
conda create -n thinksound python=3.10
conda activate thinksound
pip install thinksound
conda install -y -c conda-forge 'ffmpeg<7'
# Download pretrained weights https://huggingface.co/liuhuadai/ThinkSound to Directory ckpts/
# model weights can be also downloaded from https://www.modelscope.cn/models/iic/ThinkSound
git lfs install
git clone https://huggingface.co/liuhuadai/ThinkSound ckpts
# To improve inference and training speed, you may optionally install a FlashAttention backend compatible with your system and PyTorch version.

Windows Tip:
Windows users can simply run setup_windows.bat (or double-click it) to automatically create the conda environment, install all dependencies (including FFmpeg), and download the pretrained model — no manual setup required.
Make sure conda and git are installed and available in your system PATH before running the script.

▶️ Run the Demo

**Docker for WSL/Ubuntu

Use to run this with own workspace that already clone this repo + models

Prerequisite

  1. Download all required models. Warning. It is large
sudo apt install git-lfs

git clone https://huggingface.co/facebook/metaclip-h14-fullcc2.5b
git clone https://huggingface.co/google/t5-v1_1-xl
git clone https://huggingface.co/liuhuadai/ThinkSound ckpts
  1. Move all the models to the root of this repository

Pull ready docker image

  1. If your gpu support cuda 12.6.x, you can pull this image
docker pull sasuketaichou/sajenakcube:thinksound

Note: You can skip Build local step if you do this, go to Run docker step

Build local

Note: Please check your supported nvidia cuda version with your device. Change FROM cuda-version-of-your-device of Dockerfile

  1. Run at the root of this repository
docker build -t thinksound:latest .

Run docker

  1. Append your local ThinkSound workspace with the models that we just downloaded. (this is done in start_docker.sh)

  2. To attach ThinkSound folder via script

cd ..
ls ## make sure ThinkSound folder is visible
  1. Run the script

If pull from ready docker image

docker run --gpus all -it -v $(pwd)/ThinkSound:/app --rm -p 7860:7860 --net=host sasuketaichou/sajenakcube:thinksound

If build local

docker run --gpus all -it -v $(pwd)/ThinkSound:/app --rm -p 7860:7860 --net=host thinksound:latest

Test via browser localhost:7860

Linux/macOS

chmod +x scripts/demo.sh
./scripts/demo.sh <path-to-your-demo-video> <title> <CoT description> [use-half]

Windows

You can use the provided .bat script instead:

.\scripts\demo.bat <path-to-your-demo-video> <title> <CoT description> [use-half]

Note:

  • <path-to-your-demo-video>: The path to a single video
  • [use-half] (optional): Add use-half at the end to enable half precision feature extraction.

📦 Batch Inference

Linux/macOS

chmod +x scripts/eval_batch.sh
./scripts/eval_batch.sh <video_path> <csv_path> <save_path (optional)> [use-half]

Windows

Use the equivalent .bat script:

.\scripts\eval_batch.bat <video_path> <csv_path> <save_path (optional)> [use-half]

Note:

  • <video_path>: Path to the root directory containing all .mp4 videos to be processed (all videos must be of equal duration).
  • <csv_path>: A CSV file with text prompts for each video (see demo_test.csv for format).
  • <save_path> (optional): Where to save generated audio. Defaults to results/features.
  • [use-half] (optional): Add use-half at the end to enable half precision feature extraction.

Web Interface Usage

For an interactive experience, launch the Gradio web interface:

python app.py

📝 TODO & Future Plans

    • Release training scripts for ThinkSound models (Expected before 07/20/2025)
    • Open-source AudioCoT dataset and automated pipeline (Expected before 07/23/2025)
    • Provide a ready-to-use environment image (Expected before 07/23/2025)
    • Release a more powerful foundation model covering multiple domains to provide more engaging and immersive foley creation (Expected by end of August 2025)
    • Add support for additional modalities and downstream tasks (Expected before end of July 2025)
    • Release models at different scales (Expected before end of July 2025)
    • A beginner-friendly Windows quick-start README

📄 License

This project is released under the Apache 2.0 License.

Note: The code, models, and dataset are for research and educational purposes only. Commercial use is NOT permitted. For commercial licensing, please contact the authors.

📦 Third-Party Components

  • Stable Audio Open VAE (by Stability AI): This repository includes a fine-tuned VAE from Stable Audio Open, licensed under the Stability AI Community License. Commercial use and redistribution require prior permission from Stability AI.

  • 📘 All other code and models are released under the Apache License 2.0.


Acknowledgements

Many thanks to:

  • stable-audio-tools (by Stability AI): For providing an easy-to-use framework for audio generation, as well as the VAE module and weights.
  • MMAudio: For the implementation of the MM-DiT backbone in the audio domain.

📖 Citation

If you find ThinkSound useful in your research or work, please cite our paper:

@misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
      title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing}, 
      author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
      year={2025},
      eprint={2506.21448},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2506.21448}, 
}

📬 Contact

✨ Feel free to open an issue or contact us via email ([email protected]) if you have any questions or suggestions!

About

PyTorch implementation of [ThinkSound], a unified framework for generating audio from any modality, guided by Chain-of-Thought (CoT) reasoning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.1%
  • Other 1.9%