In this work, we propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph, which serves as input for LLMs to perform 3D vision-language tasks.
[2024.12] We release 3DGraphLLM pre-training on GT instance segmentation scene graphs
[2024.12] We release 3DGraphLLM paper code
🔥 Semantic relations boost LLM performance on 3D Referred Object Grounding and Dense Scene Captioning tasks
ScanRefer | Multi3dRefer | Scan2Cap | ScanQA | SQA3D | |||||
---|---|---|---|---|---|---|---|---|---|
[email protected] | [email protected] | [email protected] | [email protected] | [email protected] | [email protected] | CIDEr | B-4 | EM | |
Chat-Scene | 55.5 | 50.2 | 57.1 | 52.3 | 77.1 | 36.3 | 87.7 | 14.3 | 54.6 |
3DGraphLLM Vicuna-1.5 | 57.0 | 51.3 | 60.1 | 55.4 | 81.2 | 36.3 | 87.6 | 12.1 | 53.1 |
3DGraphLLM LLAMA3-8B | 60.2 | 54.6 | 63.0 | 58.2 | 82.9 | 37.8 | 83.1 | 12.5 | 55.2 |
-
Prepare the environment:
conda create -n 3dgraphllm python=3.9.17 conda activate 3dgraphllm conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia pip install -r requirements.txt
-
If you don't have root permissions to install java (needed for pycocoeval scripts for metrics such as BLEU and CIDER), install it with conda:
conda install -c conda-forge openjdk
-
Download LLM backbone:
-
We use LLAMA3-8B-Instruct in our experiments, which can be downloaded from Hugging Face.
-
Change the
llama_model_path
in config.py to the path ofLLAMA3-8B-Instruct
.
-
-
Annotations and extracted features:
Please follow the instructions in preprocess.
-
Pre-training on GT instance segmentation scene graphs.
-
Modify run_gt_pretrain.sh:
train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align" val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref" evaluate=False
Explanation of "train_tag" and "val_tag"
-
Use
#
to seperate different datasets -
Datasets:
-
-
Run:
bash scripts/run_gt_pretrain.sh
-
-
Training
- Modify run.sh:
train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align" val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref" evaluate=False pretrained_path="outputs/llama3-8b-gt-pretrain-2/ckpt_00_28927.pth"
- Run:
bash scripts/run.sh
- Modify run.sh:
-
Inference
-
Modify run.sh:
val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref" evaluate=True pretrained_path="/path/to/pretrained_model.pth"
-
Run:
bash scripts/run.sh
-
- Run:
bash demo/run_demo.sh
. You will be prompted to ask different queries about Scene 435 of ScanNet.
If you have any questions about the project, please open an issue in this repository or send an email to Tatiana Zemskova.
If you find this work helpful, please consider citing our work as:
@misc{zemskova20243dgraphllm,
title={3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding},
author={Tatiana Zemskova and Dmitry Yudin},
year={2024},
eprint={2412.18450},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.18450},
}
Thanks to the open source of the following projects: