LLaVA-VL · kcz358 · May 6, 2024 · May 6, 2024 · May 7, 2024 · May 8, 2024
diff --git a/.dockerignore b/.dockerignore
diff --git a/.editorconfig b/.editorconfig
diff --git a/.gitattributes b/.gitattributes
diff --git a/.gitignore b/.gitignore
@@ -7,23 +7,27 @@ dist
 # Log
 *.log
 *.log.*
-*.json
-*.jsonl
+# *.json
+# *.jsonl
 
 # Data
 !**/alpaca-data-conversation.json
-
 # Editor
 .idea
 *.swp
+.vscode
 
 # Other
 .DS_Store
 wandb
 output
+llavavid
 
 checkpoints
+project_checkpoints
+debug_checkpoints
 playground/data
+playground/cc3m_llava34b_cap
 ckpts*
 
 .ipynb_checkpoints
@@ -35,4 +39,35 @@ chunyl_scripts
 
 # Demo
 serve_images/
-notebooks/
+notebooks/
+logs
+scripts/dist_*
+logs/
+submissions/
+cn_scripts/
+internal_project_checkpoints/
+work_dirs
+scripts/i18n/*
+playground/.nfs028b000000010add00000001
+HIP
+playground/.nfs028b0000017bff2c00000012
+scripts/qwen
+scripts/vicuna
+scripts/mistral
+scripts/baseline_rep
+scripts/cn_boli01_hl
+scripts/cn_boli01_lf
+scripts/cn_lf
+scripts/cn_lq
+scripts/cn_yg
+scripts/cn_yg_hao
+scripts/eva_encoder
+scripts/i18n
+scripts/i18n_higher_res
+scripts/multi-images
+scratchpad
+build/
+playground/*.json
+mlx_configs/
+data_processing/
+# demo/
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
diff --git a/docs/LLaVA-NeXT-Interleave.md b/docs/LLaVA-NeXT-Interleave.md
@@ -0,0 +1,53 @@
+
+# LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models
+
+## Contents
+- [Demo](#demo)
+- [Evaluation](#evaluation)
+
+## Demo
+
+> make sure you installed the LLaVA-NeXT model files via outside REAME.md
+
+1. **Example model:** `lmms-lab/llava-next-interleave-7b`
+
+
+To run a demo, execute:
+```bash
+# If you find any bug when running the demo, please make sure checkpoint path contains 'qwen'.
+# You can try command like 'mv llava-next-interleave-7b llava-next-interleave-qwen-7b'
+python playground/demo/interleave_demo.py --model_path path/to/ckpt
+```
+
+## Evaluation
+
+### Preparation
+
+Please download the evaluation data and its metadata from the following links:
+
+1. **llava-interleave-bench:** [here](https://huggingface.co/datasets/lmms-lab/llava-interleave-bench).
+
+Unzip eval_images.zip and there are Split1 and Split2 in it.
+Organize the downloaded data into the following structure:
+```
+
+interleave_data
+├── Split1
+│   ├── ...
+│   └── ...
+|
+├── Split2
+|   ├── ...
+│   └── ...
+├── multi_image_in_domain.json
+├── multi_image_out_domain.json
+└── multi_view_in_domain.json
+```
+
+### Inference and Evaluation
+Example:
+Please first edit /path/to/ckpt to the path of checkpoint, /path/to/images to the path of "interleave_data" in scripts/interleave/eval_all.sh and then run
+```bash
+bash scripts/interleave/eval_all.sh
+```
+
diff --git a/docs/LLaVA-NeXT-Video.md b/docs/LLaVA-NeXT-Video.md
@@ -0,0 +1,81 @@
+
+# LLaVA-NeXT: A Strong Zero-shot Video Understanding Model
+
+## Contents
+- [Demo](#demo)
+- [Evaluation](#evaluation)
+
+## Demo
+
+> make sure you installed the LLaVA-NeXT model files via outside REAME.md
+
+1. **Example model:** `lmms-lab/LLaVA-NeXT-Video-7B-DPO`
+
+2. **Prompt mode:** `vicuna_v1` (use `mistral_direct` for `lmms-lab/LLaVA-NeXT-Video-34B-DPO`)
+
+3. **Sampled frames:** `32` (Defines how many frames to sample from the video.)
+
+4. **Spatial pooling stride:** `2` (With original tokens for one frame at 24x24, if stride=2, then the tokens for one frame are 12x12.)
+
+5. **Spatial pooling mode:** `average` (Options: `average`, `max`.)
+
+6. **Local video path:** `./data/llava_video/video-chatgpt/evaluation/Test_Videos/v_Lf_7RurLgp0.mp4`
+
+To run a demo, execute:
+```bash
+bash scripts/video/demo/video_demo.sh ${Example model} ${Prompt mode} ${Sampled frames} ${Spatial pooling stride} ${Spatial pooling mode} grid True ${Video path at local}
+```
+Example:
+```bash
+bash scripts/video/demo/video_demo.sh lmms-lab/LLaVA-NeXT-Video-7B-DPO vicuna_v1 32 2 average no_token True playground/demo/xU25MMA2N4aVtYay.mp4
+```
+
+**IMPORTANT** Please refer to [Latest video model](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT-Video_0716.md) for the runnning of the latest model.
+
+## Evaluation
+
+### Preparation
+
+Please download the evaluation data and its metadata from the following links:
+
+1. **video-chatgpt:** [here](https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/quantitative_evaluation/README.md#video-based-generative-performance-benchmarking).
+2. **video_detail_description:** [here](https://mbzuaiac-my.sharepoint.com/personal/hanoona_bangalath_mbzuai_ac_ae/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fhanoona%5Fbangalath%5Fmbzuai%5Fac%5Fae%2FDocuments%2FVideo%2DChatGPT%2FData%5FCode%5FModel%5FRelease%2FQuantitative%5FEvaluation%2Fbenchamarking%2FTest%5FHuman%5FAnnotated%5FCaptions%2Ezip&parent=%2Fpersonal%2Fhanoona%5Fbangalath%5Fmbzuai%5Fac%5Fae%2FDocuments%2FVideo%2DChatGPT%2FData%5FCode%5FModel%5FRelease%2FQuantitative%5FEvaluation%2Fbenchamarking&ga=1).
+3. **activity_qa:** [here](https://mbzuaiac-my.sharepoint.com/personal/hanoona_bangalath_mbzuai_ac_ae/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fhanoona%5Fbangalath%5Fmbzuai%5Fac%5Fae%2FDocuments%2FVideo%2DChatGPT%2FData%5FCode%5FModel%5FRelease%2FData%2FActivityNet%5FTest%2D1%2D3%5Fvideos%2Ezip&parent=%2Fpersonal%2Fhanoona%5Fbangalath%5Fmbzuai%5Fac%5Fae%2FDocuments%2FVideo%2DChatGPT%2FData%5FCode%5FModel%5FRelease%2FData&ga=1) and [here](https://github.com/MILVLG/activitynet-qa/tree/master/dataset).
+
+Organize the downloaded data into the following structure:
+```
+LLaVA-NeXT
+├── llava
+├── scripts
+└── data
+    └── llava_video
+        ├── video-chatgpt
+        │   ├── Test_Videos
+        │   ├── consistency_qa.json
+        │   ├── consistency_qa_test.json
+        │   ├── consistency_qa_train.json
+        ├── video_detail_description
+        │   └── Test_Human_Annotated_Captions
+        └── ActivityNet-QA
+            ├── all_test
+            ├── test_a.json
+            └── test_b.json
+```
+
+### Inference and Evaluation
+
+Example for video detail description evaluation (additional scripts are available in `scripts/eval`):
+```bash
+bash scripts/video/eval/video_detail_description_eval_shard.sh ${Example model} ${Prompt mode} ${Sampled frames} ${Spatial pooling stride} True 8
+```
+Example:
+```bash
+bash scripts/eval/video_detail_description_eval_shard.sh liuhaotian/llava-v1.6-vicuna-7b vicuna_v1 32 2 True 8 
+```
+
+### GPT Evaluation Example (Optional if the above step is completed)
+
+Assuming you have `pred.json` (model-generated predictions) for model `llava-v1.6-vicuna-7b` at `./work_dirs/eval_video_detail_description/llava-v1.6-vicuna-7b_vicuna_v1_frames_32_stride_2`:
+```bash
+bash scripts/video/eval/video_description_eval_only.sh llava-v1.6-vicuna-7b_vicuna_v1_frames_32_stride_2
+```
diff --git a/docs/LLaVA-NeXT-Video_0716.md b/docs/LLaVA-NeXT-Video_0716.md
@@ -0,0 +1,42 @@
+## LLaVA-NeXT-Video is upgraded 🚀
+
+In our [LLaVA-Video blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/) released this April, we shared two key observations: 
+- 🎬 AnyRes provides a shared and flexible representation between images and videos, and thus accommodates capability transfer between the two most common vision signals. Therefore, stronger image LMMs can naturally lead to stronger zero-shot video LMMs. 
+- 🗂️ There is a lack of high-quality language-video data, including video instruction-following data, and thus naive tuning on existing public data at that time results in performance degradation. Therefore, there is an urgent need to build high-quality video captions and QA datasets to train LMMs for improved video performance.
+
+Based on the insights, the new LLaVA-NeXT-Video in this release improves from two aspects:
+
+- 🎬 A stronger image LMMs ([LLaVA-NeXT-32B-Qwen](https://huggingface.co/lmms-lab/llava-next-qwen-32b)), which is built by initializing from Qwen-1.5 32B LLM. We further initialize our video training from this image checkpoint.
+- 🗂️ A new high-quality video dataset with 830k samples. It is combined with LLaVA-1.6 image training data, and applying the same image-video mixed training procedure leads to the new video model.
+The new model achieves the best open-source performance in several video benchmarks including [Video-MME](https://video-mme.github.io/home_page.html#leaderboard).
+
+### Resources
+- **Model Card**: [LLaVA-NeXT-Video-32B-Qwen on Hugging Face](https://huggingface.co/lmms-lab/LLaVA-NeXT-Video-32B-Qwen)
+- **Inference Script**:
+  ```bash
+  bash scripts/video/demo/video_demo.sh lmms-lab/LLaVA-NeXT-Video-32B-Qwen qwen_1_5 32 2 average grid True playground/demo/xU25MMA2N4aVtYay.mp4
+  ```
+
+### Evaluation Results
+| Model                       | NextQA-MC | video-mme(overall) |        | Egochema | Perception Test  (val) |
+|-----------------------------|-----------|--------------------|--------|----------|------------------------|
+|                             |           | w/o subs           | w subs |          |                        |
+| **Proprietary**                 |           |                    |        |          |                        |
+| GPT-4o                      | -         | 71.9               | 77.2   | 72.2     | -                      |
+| Gemini 1.5 Pro              | -         | 75.0               | 81.3   | 72.2     | -                      |
+| **Open-Source**                 |           |                    |        |          |                        |
+| VideoLLaMA 2 (8x7B)         | 76.3*     | 47.9               | 50.3   | 53.3     | 51.2*                  |
+| VILA-1.5-34B                | 67.89*    | 60.1               | 61.1   | 58.04*   | 54                     |
+| LLaVA-NeXT-Video (Qwen-32B) | 77.31     | 60.2               | 63.0   | 60.85    | 59.38                  |
+
+_*Results are reproduced by [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval). Please refer to the lmms-eval to reproduce the results._
+
+### Citations
+```bibtex
+@misc{zhang2024llavanextvideo,
+  title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
+  url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
+  author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
+  month={April},
+  year={2024}
+}
diff --git a/docs/LLaVA-NeXT.md b/docs/LLaVA-NeXT.md
@@ -0,0 +1,91 @@
+# LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild
+
+## Quick Start With HuggingFace
+First please install our repo with code and environments: `pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git`
+
+Here is a quick inference code using [`llavanext-llama3-8B`](https://huggingface.co/lmms-lab/llama3-llava-next-8b) as an example. You will need to install [`flash-attn`](https://github.com/Dao-AILab/flash-attention) to use this code snippet. If you don't want to install it, you can set `attn_implementation=None` when load_pretrained_model
+```python
+from llava.model.builder import load_pretrained_model
+from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
+from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
+from llava.conversation import conv_templates, SeparatorStyle
+
+from PIL import Image
+import requests
+import copy
+import torch
+
+pretrained = "lmms-lab/llama3-llava-next-8b"
+model_name = "llava_llama3"
+device = "cuda"
+device_map = "auto"
+tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
+
+model.eval()
+model.tie_weights()
+
+url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
+image = Image.open(requests.get(url, stream=True).raw)
+image_tensor = process_images([image], image_processor, model.config)
+image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
+
+conv_template = "llava_llama_3" # Make sure you use correct chat template for different models
+question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
+conv = copy.deepcopy(conv_templates[conv_template])
+conv.append_message(conv.roles[0], question)
+conv.append_message(conv.roles[1], None)
+prompt_question = conv.get_prompt()
+
+input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
+image_sizes = [image.size]
+
+
+cont = model.generate(
+    input_ids,
+    images=image_tensor,
+    image_sizes=image_sizes,
+    do_sample=False,
+    temperature=0,
+    max_new_tokens=256,
+)
+text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
+print(text_outputs)
+# The image shows a radar chart, also known as a spider chart or a web chart, which is a type of graph used to display multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. Each axis represents a different variable, and the values are plotted along each axis and connected to form a polygon.\n\nIn this particular radar chart, there are several axes labeled with different variables, such as "MM-Vet," "LLaVA-Bench," "SEED-Bench," "MMBench-CN," "MMBench," "TextVQA," "VizWiz," "GQA," "BLIP-2," "InstructBLIP," "Owen-VL-Chat," and "LLaVA-1.5." These labels suggest that the chart is comparing the performance of different models or systems across various benchmarks or tasks, such as machine translation, visual question answering, and text-based question answering.\n\nThe chart is color-coded, with each color representing a different model or system. The points on the chart are connected to form a polygon, which shows the relative performance of each model across the different benchmarks. The closer the point is to the outer edge of the
+```
+
+## Evaluation
+
+**Install the evaluation package:**
+```bash
+# make sure you installed the LLaVA-NeXT model files via outside REAME.md
+pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
+```
+
+### Check the evaluation results with [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)
+Our models' evaluation results can be fully reproduced by using the [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval) toolkit. After you install lmms-eval and llava, you can run the evaluation using the following commands. To run following commands, you will have to install [`flash-attn`](https://github.com/Dao-AILab/flash-attention). If you do not want to install it, you can disable the flash-attn by specifying it in `--model_args pretrained=lmms-lab/llama3-llava-next-8b,conv_template=llava_llama_3,attn_implementation=None`.
+
+Please note that different torch versions might causing the results to vary.
+
+```shell
+# Evaluating Llama-3-LLaVA-NeXT-8B on multiple datasets
+accelerate launch --num_processes=8 \
+  -m lmms_eval \
+  --model llava \
+  --model_args pretrained=lmms-lab/llama3-llava-next-8b,conv_template=llava_llama_3 \
+  --tasks ai2d,chartqa,docvqa_val,mme,mmbench_en_dev \
+  --batch_size 1 \
+  --log_samples \
+  --log_samples_suffix llava_next \
+  --output_path ./logs/
+
+# Evaluating LLaVA-NeXT-72B on multiple datasets
+accelerate launch --num_processes=1 \
+  -m lmms_eval \
+  --model llava \
+  --model_args pretrained=lmms-lab/llava-next-72b,conv_template=qwen_1_5,model_name=llava_qwen,device_map=auto \
+  --tasks ai2d,chartqa,docvqa_val,mme,mmbench_en_dev \
+  --batch_size 1 \
+  --log_samples \
+  --log_samples_suffix llava_next \
+  --output_path ./logs/
+```