Skip to content

3DGraphLLM is a model that uses a 3D scene graph and an LLM to perform 3D vision-language tasks.

License

Notifications You must be signed in to change notification settings

CognitiveAISystems/3DGraphLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

3DGraphLLM

arXiv Huggingace

In this work, we propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph, which serves as input for LLMs to perform 3D vision-language tasks.

News

[2024.12] We release 3DGraphLLM pre-training on GT instance segmentation scene graphs

[2024.12] We release 3DGraphLLM paper code

🔥 Semantic relations boost LLM performance on 3D Referred Object Grounding and Dense Scene Captioning tasks

ScanRefer Multi3dRefer Scan2Cap ScanQA SQA3D
[email protected] [email protected] [email protected] [email protected] [email protected] [email protected] CIDEr B-4 EM
Chat-Scene 55.5 50.2 57.1 52.3 77.1 36.3 87.7 14.3 54.6
3DGraphLLM Vicuna-1.5 57.0 51.3 60.1 55.4 81.2 36.3 87.6 12.1 53.1
3DGraphLLM LLAMA3-8B 60.2 54.6 63.0 58.2 82.9 37.8 83.1 12.5 55.2

🔨 Preparation

  • Prepare the environment:

    conda create -n 3dgraphllm python=3.9.17
    conda activate 3dgraphllm
    conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia
    pip install -r requirements.txt
  • If you don't have root permissions to install java (needed for pycocoeval scripts for metrics such as BLEU and CIDER), install it with conda:

conda install -c conda-forge openjdk
  • Download LLM backbone:

    • We use LLAMA3-8B-Instruct in our experiments, which can be downloaded from Hugging Face.

    • Change the llama_model_path in config.py to the path of LLAMA3-8B-Instruct.

  • Annotations and extracted features:

    Please follow the instructions in preprocess.

🤖 Training and Inference

  • Pre-training on GT instance segmentation scene graphs.

    • Modify run_gt_pretrain.sh:

      train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align"
      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=False
      Explanation of "train_tag" and "val_tag"
      • Use # to seperate different datasets

      • Datasets:

        • scanrefer: ScanRefer Dataset
        • scan2cap: Scan2Cap Dataset
        • scanqa: ScanQA Dataset
        • sqa3d: SQA3D Dataset
        • multi3dref: Multi3dRefer Dataset
        • nr3d_caption: A captioning dataset originated from Nr3D.
        • obj_align: A dataset originated from ScanRefer to align the object identifiers with object tokens.
    • Run: bash scripts/run_gt_pretrain.sh

  • Training

    • Modify run.sh:
      train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align"
      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=False
      pretrained_path="outputs/llama3-8b-gt-pretrain-2/ckpt_00_28927.pth"
    • Run: bash scripts/run.sh
  • Inference

    • Modify run.sh:

      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=True
      pretrained_path="/path/to/pretrained_model.pth"
    • Run: bash scripts/run.sh

🚀 Demo

  • Run: bash demo/run_demo.sh. You will be prompted to ask different queries about Scene 435 of ScanNet.

📪 Contact

If you have any questions about the project, please open an issue in this repository or send an email to Tatiana Zemskova.

📑 Citation

If you find this work helpful, please consider citing our work as:

@misc{zemskova20243dgraphllm,
      title={3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding}, 
      author={Tatiana Zemskova and Dmitry Yudin},
      year={2024},
      eprint={2412.18450},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.18450}, 
}

😊 Acknowledgement

Thanks to the open source of the following projects:

Chat-Scene

About

3DGraphLLM is a model that uses a 3D scene graph and an LLM to perform 3D vision-language tasks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published