Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding (AAAI 2025)

Repo for the paper "Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding".

Installation

We recommend setting up a conda environment for the project:

git clone https://github.com/yunlong10/AVicuna.git
cd AVicuna

conda env create -f avicuna.yml
conda activate avicuna

Data & Checkpoints

Download the metadata in JSON here and place them into the ./data folder.

Download the fine-tuned model's checkpoints here and place them into the ./checkpoints folder.

- data
    - stage1.json
    - stage2.json
    - stage3.json
    - stage4.json

- checkpoints
    - avicuna-vicuna-v1-5-7b-stage1
    - avicuna-vicuna-v1-5-7b-stage2
    - avicuna-vicuna-v1-5-7b-stage3
    - avicuna-vicuna-v1-5-7b-stage4
    - clip
        - ViT-L-14.pt

Inference

python -m avicuna.inference

Features

The video and audio features can be extracted by ./avicuna/get_clip.py and ./avicuna/get_clap.py. You can also download the extracted features here.

Training

We train our model on a single NVIDIA A6000 48G GPU.

Stage I: Vision-Text Alignment

bash scripts/stage1.sh

Stage II: Audio-Text Alignment

bash scripts/stage2.sh

Stage III: Time-Event Alignment

bash scripts/stage3.sh

Stage IV: Instruction Tuning

bash scripts/stage4.sh

Pseudo-Untrimmed Video Construction Pipeline

Coming soon ...

Acknowledgements

This work was supported by Sony Group Corporation. We would like to thank Sayaka Nakamura and Jerry Jun Yokono for insightful discussion.

We are also thankful for the following awesome projects our AVicuna arising from:

LLaMA: Open and efficient foundation language models.
FastChat: An open platform for training, serving, and evaluating large language model based chatbots.
Video-ChatGPT: Towards detailed video understanding via large vision and language models.
Vid2seq: Large-scale pretraining of a visual language model for dense video captioning.
VTimeLLM: A Vid-LLM for fine-grained video moment understanding.
VALOR-32K: A audiovisual-language dataset.
UnAV-100: An untrimmed video dataset for dense audio-visual event localization.
Auto-ACD: a large-scale dataset for audio-language representation learning.
AudioSet: A large-scale dataset of manually annotated audio events.
AudioCap: Towards generating natural language description for any kind of audio in the wild.
InternVid: A large-scale video-text dataset.

Citation

@article{tang2024avicuna,
  title={Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding},
  author={Tang, Yunlong and Shimada, Daiki and Bi, Jing and Feng, Mingqian and Hua, Hang and Xu, Chenliang},
  journal={arXiv preprint arXiv:2403.16276},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding (AAAI 2025)

Installation

Data & Checkpoints

Inference

Features

Training

Pseudo-Untrimmed Video Construction Pipeline

Acknowledgements

Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding (AAAI 2025)

Installation

Data & Checkpoints

Inference

Features

Training

Pseudo-Untrimmed Video Construction Pipeline

Acknowledgements

Citation