Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
745 commits
Select commit Hold shift + click to select a range
d4a6537
Update builder script
kcz358 May 6, 2024
7280ca4
Comment out print in conversation
kcz358 May 6, 2024
661f376
Refactor DPODataset and add multimodal support to llava_mixtral
Luodian May 7, 2024
521e873
Add auto load for non matching model and warning message
kcz358 May 8, 2024
a5fd660
Fix image processing and moderation error handling
Luodian May 11, 2024
307ded6
updates
May 11, 2024
67f6b6c
Squashed commit of the following:
May 11, 2024
e51a777
Remove HIP subproject
May 11, 2024
7bad66c
update
May 11, 2024
8a3b390
Refactor image search function and update script to submit jobs on Me…
May 12, 2024
6cd63f1
Refactor code to remove unused imports and commented out code
Luodian May 12, 2024
4188463
Merge branch 'bo/dev_dpo' of code.byted.org:ic-research/llava_next in…
Luodian May 12, 2024
695c59f
Merge branch 'bo/dev_dpo' of code.byted.org:ic-research/llava_next in…
May 12, 2024
fd9468a
Add new file to .gitignore
May 12, 2024
8918865
Update submit_jobs_sglang_on_merlin.sh script
May 12, 2024
5530c8a
Add proxy configuration
May 12, 2024
560801c
Refactor code to improve performance and readability
Luodian May 12, 2024
f90839e
Merge branch 'bo/dev_dpo' of code.byted.org:ic-research/llava_next in…
Luodian May 12, 2024
3ca2d35
Delete unused scripts and configuration files
Luodian May 13, 2024
c111cb8
🚫 Ignored scripts expanded to streamline development
Luodian May 13, 2024
c33f945
Include "llava*" and "trl*" packages in setuptools find
Luodian May 13, 2024
ad49de6
refactor(builder.py): refactor loading logic for LLaVA models to impr…
Luodian May 14, 2024
f6bb2f9
fix resampler bug in video
jzhang38 May 14, 2024
c2cbd62
Add support for EVA-CLIP-8B-plus vision tower and handle vision tower…
Luodian May 14, 2024
61700ca
Refactor vision tower loading in multimodal encoders
Luodian May 14, 2024
8d589e4
Fix config overwrite bug in train.py
Luodian May 16, 2024
d1bf138
Merge pull request #1 from EvolvingLMMs-Lab/py/dev
Luodian May 16, 2024
b3c12f7
add peiyuan ablation
jzhang38 May 16, 2024
758fce2
Update dependencies and training script
Luodian May 16, 2024
a1f4715
Fix formatting and whitespace issues
Luodian May 16, 2024
1054700
Refactor video processing using pyav library
Luodian May 16, 2024
21a33c8
Fix video processing bug and update dependencies
Luodian May 17, 2024
1a01f76
Fix device assignment in model creation
Luodian May 23, 2024
4da6fdb
Fix code formatting and update preprocessing logic
Luodian May 23, 2024
bf6bb7f
refactor(builder.py): comment out EvaClipVisionTower and EvaViTWrappe…
Luodian May 27, 2024
d4db676
refactor(train.py): update tokenizer logic for different models
Luodian May 29, 2024
314b882
formating updates
May 29, 2024
c1351ab
refactor(conversation.py): update sep_style to use CHATML instead of MPT
Luodian Jun 3, 2024
82a4a13
refactor(conversation.py): update roles in Conversation class
Luodian Jun 3, 2024
633d46a
Change 2 \n tokens into one \n\n token
Luodian Jun 3, 2024
35564e7
Change pad token from eos to 0 for llama3
Luodian Jun 3, 2024
8c66c1a
chore: Add llavavid to .gitignore and update conversation.py and llav…
Luodian Jun 3, 2024
6f00ecf
Merge branch 'dev/merge_multi_images' of code.byted.org:ic-research/l…
Luodian Jun 3, 2024
6025d53
update gitignore
Luodian Jun 4, 2024
d2bd103
Merge branch 'dev/merge_multi_images' of code.byted.org:ic-research/l…
Luodian Jun 4, 2024
5649d34
Change preprocess llama3 into form of chat template to process interl…
Luodian Jun 4, 2024
eb583d8
Add default id for data dict
Luodian Jun 5, 2024
6a70727
refactor(conversation.py): update separator style to use MPT instead …
Luodian Jun 5, 2024
1f62add
Merge branch 'dev/merge_multi_images' of code.byted.org:ic-research/l…
Luodian Jun 5, 2024
4edce66
Fix tokenizer pad token issue for llama3 model
Luodian Jun 5, 2024
15e4b47
refactor: Update tokenizer pad token for llama3 model
Luodian Jun 11, 2024
fcc4f62
add back qwen moe
Jun 12, 2024
9684f69
fix qwen-moe modality issues
Jun 12, 2024
a76a451
refactor: Update data_checker.py to process YAML files instead of JSON
Luodian Jun 13, 2024
0bd783b
Fix preprocess llama3 when handling non image data but also has <image>
Luodian Jun 14, 2024
f85a604
chore: Update deepspeed dependency to version 0.14.2
Luodian Jun 14, 2024
b67003f
updates
Jun 15, 2024
15100cd
Merge branch 'dev/merge_multi_images' of code.byted.org:ic-research/l…
Jun 15, 2024
72402b0
updates
Jun 15, 2024
d50e869
better data checker
Jun 15, 2024
efb2581
chore: Update gitignore and add build/ and playground/*.json
Luodian Jun 18, 2024
1d41406
update
Jun 19, 2024
797a9be
Merge branch 'dev/merge_multi_images' of code.byted.org:ic-research/l…
Jun 19, 2024
5922d39
Add jsonl and video support for data checker
Luodian Jun 20, 2024
13705c8
chore: Update parallel value for sglang inference script
Luodian Jun 20, 2024
8497588
Refactor multimodal encoder modules for improved performance and read…
Luodian Jun 21, 2024
b1edc37
Refractor Qwen preprocess code to multi images
Luodian Jun 25, 2024
fae02f2
Data checker for multi images
Luodian Jun 25, 2024
77ca06d
Refactor Qwen preprocess code to support multi images
Luodian Jun 26, 2024
74d6078
Fix only loading key frames in pyva
Luodian Jun 26, 2024
827a162
Add bilinear pooling and multi-images modality
Luodian Jun 26, 2024
62a8862
Fix training code for multi-images training
Luodian Jun 26, 2024
450f1f4
Merge branch 'dev/merge_multi_images' of code.byted.org:ic-research/l…
Luodian Jun 26, 2024
dbdf77b
Refactor video processing code to support multi images
Luodian Jun 26, 2024
2c0eafc
Refactor mm_spatial_pool_mode to use bilinear interpolation for multi…
Luodian Jun 27, 2024
8adb1be
Merge branch 'dev/merge_multi_images' of code.byted.org:ic-research/l…
Luodian Jun 27, 2024
ef038e8
Refactor video processing code to use decord for multi images
Luodian Jun 27, 2024
97f819b
Fix branching bugs
Luodian Jun 27, 2024
367a905
Fix pyav read video packet
Luodian Jun 27, 2024
0c2698f
Change decode backend to pyav
Luodian Jun 27, 2024
e65d417
Logs in llava_arch
Luodian Jun 27, 2024
d44d946
Fix decord load videos and add rank print
Luodian Jun 28, 2024
7f828d2
Comment out log message
Luodian Jun 28, 2024
ded2a8d
zero2 overlap comm to false
Luodian Jun 28, 2024
7b51a62
Decord more robust
Luodian Jun 29, 2024
2042d0b
Less change and modalities for multi-images
Luodian Jun 29, 2024
efa0249
Remove the changes in llava_arc
Luodian Jun 29, 2024
a207a36
Merge commit 'aea51bf96916ef7e595f72d63a972d7ffdbb3d56'
Luodian Jul 1, 2024
b022cc7
chore: update wandb dependency to latest version
Luodian Jul 1, 2024
bf4d560
refactor: update image processing logic in mm_utils.py
Luodian Jul 3, 2024
0f0be5d
feat: Load samples from data_path in train.py
Luodian Jul 3, 2024
8a80ffa
refactor: Handle exception when getting image grid shape in llava_arc…
Luodian Jul 3, 2024
8eb2dac
refactor: Update file loading logic in train.py and data_checker.py
Luodian Jul 4, 2024
3480d27
* chore(data_checker.py): remove print statement in main function
Luodian Jul 6, 2024
4dac286
🔧 fix(data_checker.py): refactor load_json_data method to handle json…
Luodian Jul 8, 2024
ec06dcf
refactor: Improve error handling in llava.model.__init__.py
Luodian Jul 8, 2024
24f8e25
refactor: Update file loading logic in train.py and data_checker.py
Luodian Jul 14, 2024
518b78a
No pooling for multi-images during training
Luodian Jul 14, 2024
65d5f74
refactor: Update file loading logic in train.py and data_checker.py
Luodian Jul 14, 2024
1e213d2
Merge branch 'dev/one_vision' of code.byted.org:ic-research/llava_nex…
Luodian Jul 14, 2024
76215ab
feat: Add support for early mixing of text in multimodal training
Luodian Jul 16, 2024
c971323
feat: Add support for early mixing of text in multimodal training
Luodian Jul 20, 2024
625fd44
refactor: Update file loading logic in train.py and data_checker.py
Luodian Jul 25, 2024
dbd1988
refactor: Update file loading logic in train.py and data_checker.py
Luodian Jul 25, 2024
923cc67
refactor: Remove deprecated image and video demo scripts
Luodian Jul 25, 2024
e1766b6
refactor: Update .gitignore and pyproject.toml
Luodian Jul 25, 2024
2d0488e
Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/LLaVA-NeXT…
Luodian Jul 25, 2024
62d9fde
refactor: Add conv_qwen_2 to conversation.py
Luodian Jul 25, 2024
d39dcc3
refactor: Update image size in conversation.py
Luodian Jul 26, 2024
4cba6e6
refactor: Update image size in conversation.py
Luodian Jul 28, 2024
fa08b1b
refactor: Update image size and conversation logic for generating tex…
Luodian Jul 28, 2024
f74b381
refactor: Fix kernel crash issue in LLaVA_OneVision_Tutorials.ipynb
Luodian Jul 29, 2024
094b3aa
refactor: Update image size and conversation logic for generating tex…
Luodian Aug 1, 2024
edb72ab
refactor: Update image size and conversation logic for generating tex…
Luodian Aug 3, 2024
2448ab2
refactor: Update image size and conversation logic for generating tex…
Luodian Aug 3, 2024
163bce3
Update LLaVA_OneVision.md
kcz358 Aug 5, 2024
72372d8
Update LLaVA_OneVision.md
kcz358 Aug 5, 2024
95dde67
Update LLaVA_OneVision.md
kcz358 Aug 5, 2024
c42742a
PR and merge conflicts from private repo
Luodian Aug 5, 2024
53cde86
update training scripts
Luodian Aug 5, 2024
cc2d68d
update readme
Luodian Aug 5, 2024
8aef611
updates
Luodian Aug 5, 2024
fa62a40
updates
Luodian Aug 6, 2024
4c9a2e6
refactor: Update image size and conversation logic for generating tex…
Luodian Aug 6, 2024
522ec93
refactor: Update image size and conversation logic for generating tex…
Luodian Aug 6, 2024
0ce5621
refactor: Update image size and conversation logic for generating tex…
Luodian Aug 6, 2024
44f44b9
Update README.md
Luodian Aug 7, 2024
e9d557f
Update README.md
Luodian Aug 7, 2024
037c6ed
Update README.md
Luodian Aug 7, 2024
7b1567f
update video code
ZhangYuanhan-AI Aug 7, 2024
0d86a28
Merge branch 'main' of https://github.com/LLaVA-VL/LLaVA-NeXT
ZhangYuanhan-AI Aug 7, 2024
0f5fb82
Update README.md
ZhangYuanhan-AI Aug 7, 2024
ede6a5e
Update LLaVA-NeXT-Video_0716.md
ZhangYuanhan-AI Aug 7, 2024
12992e4
Update README.md
ZrrSkywalker Aug 7, 2024
d73d0c7
Fix prompt version for training script
kcz358 Aug 7, 2024
3d72057
Update README.md
Luodian Aug 7, 2024
cb4a282
Update README.md
Luodian Aug 7, 2024
7a109ba
update video code
ZhangYuanhan-AI Aug 8, 2024
dbbe72c
Update LLaVA-NeXT-Video.md
ZhangYuanhan-AI Aug 8, 2024
54a502e
update video code
ZhangYuanhan-AI Aug 8, 2024
a159d98
Update README.md
Luodian Aug 10, 2024
e32a60f
updates about one-vision data (with hidden details)
Luodian Aug 10, 2024
3c4de20
Update README.md
Luodian Aug 10, 2024
83d0c34
Update README.md
Luodian Aug 10, 2024
4fb1e74
Update README.md
Luodian Aug 10, 2024
90f6d5a
fix imports of missing and deprecated qwen-moe
Luodian Aug 12, 2024
d5bd4a7
Merge pull request #134 from LLaVA-VL/patch-fix-imports
Luodian Aug 12, 2024
67bc03e
Revert llava video logic
kcz358 Aug 14, 2024
7e9ddac
Fix tutorial error
kcz358 Aug 14, 2024
98b8377
Provide the correct video processing logic with decord
kcz358 Aug 15, 2024
5a88e5b
Merge pull request #152 from LLaVA-VL/fix/onevision_tut
Luodian Aug 15, 2024
12f19e5
Update README.md
ChunyuanLI Aug 17, 2024
3afc83f
Merge pull request #161 from LLaVA-VL/ChunyuanLI-patch-2
Luodian Aug 18, 2024
637cff8
Update README.md
ChunyuanLI Aug 18, 2024
e858b39
Merge pull request #163 from LLaVA-VL/ChunyuanLI-patch-3
Luodian Aug 18, 2024
a3a96cc
Update LLaVA OneVision model to lmms-lab/llava-onevision-qwen2-7b-ov
Luodian Aug 23, 2024
b97062f
Merge pull request #180 from LLaVA-VL/patch-add_doc
Luodian Aug 23, 2024
0c1cfbc
Update README.md
Luodian Aug 26, 2024
67a9b39
update video code
ZhangYuanhan-AI Aug 26, 2024
7f087fb
Merge branch 'main' into yhzhang/video_dev
ZhangYuanhan-AI Aug 26, 2024
fa93414
Add default mm_newline_position to one_token
kcz358 Aug 26, 2024
28691a7
update demo
zucchini-nlp Aug 30, 2024
4ee44a4
Add safe load tokenizer for llama_3
ngquangtrung57 Aug 30, 2024
20e6c66
Merge pull request #198 from ngquangtrung57/fix-llama
Luodian Aug 31, 2024
2301feb
Merge pull request #195 from zucchini-nlp/main
Luodian Aug 31, 2024
e81e007
chore: add single image and onevision stage data yaml files
Luodian Sep 1, 2024
c1302d9
chore: add dataset paths for LLaVA-Instruct training
Luodian Sep 1, 2024
eb6dc85
Refactor video loading function and add time instruction
ZhangYuanhan-AI Sep 2, 2024
411d80b
Merge pull request #183 from LLaVA-VL/yhzhang/video_dev
Luodian Sep 2, 2024
a6f2a2b
update video inference logic
ZhangYuanhan-AI Sep 3, 2024
e48692a
update
ZhangYuanhan-AI Sep 4, 2024
50e758e
Update README.md
Luodian Sep 8, 2024
494e385
Revert "Fix: videos in LLaVa-OV"
kcz358 Sep 12, 2024
e5304c1
Merge pull request #228 from LLaVA-VL/revert-195-main
Luodian Sep 12, 2024
3abf004
create LLaVA-OneVision_Chat doc
tyxiong23 Sep 13, 2024
047f5d8
update checkpoint/demo links
tyxiong23 Sep 13, 2024
47cddfb
change table display
Luodian Sep 13, 2024
beafcb6
try nowrap
Luodian Sep 13, 2024
dc98e35
try smaller text
Luodian Sep 13, 2024
ebf052c
smaller text
Luodian Sep 13, 2024
ae890f7
Update table display for better readability
Luodian Sep 13, 2024
43e0096
chore: Update table display for better readability
Luodian Sep 13, 2024
dbdc8fd
modify subtitiles
tyxiong23 Sep 13, 2024
231baad
better formatting
tyxiong23 Sep 13, 2024
3f9d882
Update LLaVA_OneVision_Chat.md
ChunyuanLI Sep 13, 2024
2becc23
modify examples
tyxiong23 Sep 13, 2024
e8ebb43
Update example tables
tyxiong23 Sep 13, 2024
79c4dc2
add personal links
tyxiong23 Sep 13, 2024
8a6224b
update result figure
tyxiong23 Sep 13, 2024
b634190
Update README.md
ChunyuanLI Sep 13, 2024
37c22d6
Update README.md
ChunyuanLI Sep 13, 2024
0792221
update links
tyxiong23 Sep 13, 2024
f0d3639
add citations
tyxiong23 Sep 13, 2024
94112aa
Update LLaVA_OneVision_Chat.md
ChunyuanLI Sep 13, 2024
9e86327
Update LLaVA_OneVision_Chat.md
ChunyuanLI Sep 13, 2024
8faa916
Update LLaVA_OneVision_Chat.md
tyxiong23 Sep 13, 2024
50c74ff
Update LLaVA_OneVision_Chat.md
tyxiong23 Sep 13, 2024
7f8d73a
update figures
tyxiong23 Sep 13, 2024
05fc4ec
Merge pull request #236 from LLaVA-VL/ov-chat-doc
ChunyuanLI Sep 13, 2024
84785e4
update LLaVA_OneVision_Chat.md
tyxiong23 Sep 14, 2024
51c67ac
Merge pull request #237 from LLaVA-VL/ov-chat-doc
Luodian Sep 14, 2024
ab265d7
add dpo script
tyxiong23 Sep 15, 2024
714b62b
add dpo training scripts
tyxiong23 Sep 15, 2024
92eacc8
add dpo training scripts
tyxiong23 Sep 15, 2024
4cd2e42
add dpo training scripts
tyxiong23 Sep 15, 2024
374bdde
update training script
tyxiong23 Sep 15, 2024
3220500
Merge pull request #241 from LLaVA-VL/ov-chat-doc
ChunyuanLI Sep 15, 2024
7f1261e
Update release date of llava-ov-chat in README.md
ChunyuanLI Sep 15, 2024
b95e03e
Merge pull request #205 from LLaVA-VL/yhzhang/video_dev
Luodian Sep 17, 2024
e27155e
Update README.md
Luodian Sep 19, 2024
66e3c0e
Fix typos
Sep 19, 2024
d7c3406
add stream inference code
ZhangYuanhan-AI Sep 22, 2024
80012ea
Update finetune_onevision.sh
Luodian Sep 25, 2024
2f2db26
Create finetune_si.sh
Luodian Sep 25, 2024
811422d
Update and rename finetune_onevision.sh to finetune_ov.sh
Luodian Sep 25, 2024
d89f36a
Rename finetune_clip.sh to direct_finetune_clip.sh
Luodian Sep 25, 2024
6512734
Rename finetune_siglip_a4.sh to direct_finetune_siglip_a4.sh
Luodian Sep 25, 2024
357b9a2
Update finetune_si.sh
Luodian Sep 25, 2024
b5854cc
Merge pull request #264 from LLaVA-VL/Luodian-patch-1
Luodian Sep 25, 2024
f2d49d2
Merge pull request #250 from litianjian/main
Luodian Sep 25, 2024
8da7e6c
update
ZhangYuanhan-AI Sep 26, 2024
5a30caa
update llave-video
ZhangYuanhan-AI Oct 3, 2024
6834d47
update llava-video
ZhangYuanhan-AI Oct 3, 2024
8ad7d9d
Merge branch 'yhzhang/llava_video_local' into yhzhang/llava_video_dev
ZhangYuanhan-AI Oct 3, 2024
afecac5
Merge branch 'main' into yhzhang/llava_video_dev
ZhangYuanhan-AI Oct 3, 2024
13ba6e1
update
ZhangYuanhan-AI Oct 4, 2024
02a706c
Merge pull request #278 from LLaVA-VL/yhzhang/llava_video_dev
ChunyuanLI Oct 4, 2024
87cfb33
Update LLaVA-Video paper link
ZhangYuanhan-AI Oct 4, 2024
125e7a0
Merge pull request #279 from LLaVA-VL/yhzhang/llava_video_dev
Luodian Oct 4, 2024
51f3f2c
Update LLaVA_OneVision_Chat.md
ChunyuanLI Oct 8, 2024
e819fcf
Update README.md
Luodian Oct 11, 2024
69a03d7
chore: update checkpoint for llava-onevision-qwen2-72b-ov and llava-o…
ZhangYuanhan-AI Oct 11, 2024
7993742
Merge pull request #301 from LLaVA-VL/yhzhang/llava_video_dev
Luodian Oct 12, 2024
94f8bd8
chore: Update training script for LLaVA-NeXT video models
ZhangYuanhan-AI Oct 12, 2024
126ff12
Merge pull request #304 from LLaVA-VL/yhzhang/llava_video_dev
Luodian Oct 12, 2024
76637ce
Add multi-modal to args
kcz358 Oct 16, 2024
391e16e
Merge pull request #307 from LLaVA-VL/fix_tut
Luodian Oct 16, 2024
cf084e2
Update README.md
Luodian Feb 10, 2025
8304f45
Update README.md
Luodian Feb 10, 2025
07ea3fe
Add MLCD Vision Tower!
anxiangsir Feb 12, 2025
47e28a8
Merge pull request #412 from anxiangsir/main
Luodian Feb 13, 2025
a9bcd53
Update mlcd_encoder.py
Luodian Feb 13, 2025
cf9b54a
Update exp.yaml
ZhangYuanhan-AI Feb 24, 2025
85899c6
Merge pull request #421 from LLaVA-VL/ZhangYuanhan-AI-patch-2
Luodian Feb 24, 2025
819c669
Update LLaVA_Video_1003.md
ZhangYuanhan-AI Feb 24, 2025
d1e6bb6
Merge pull request #422 from LLaVA-VL/ZhangYuanhan-AI-patch-3
Luodian Feb 24, 2025
249449f
Update conversation.py
wenhuchen May 6, 2025
cf01910
Merge pull request #451 from wenhuchen/patch-1
Luodian May 6, 2025
c62ee4c
Update README.md
Luodian May 24, 2025
9a18541
Update README.md
Luodian May 24, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file modified .dockerignore
100644 → 100755
Empty file.
Empty file modified .editorconfig
100644 → 100755
Empty file.
Empty file modified .gitattributes
100644 → 100755
Empty file.
43 changes: 39 additions & 4 deletions .gitignore
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -7,23 +7,27 @@ dist
# Log
*.log
*.log.*
*.json
*.jsonl
# *.json
# *.jsonl

# Data
!**/alpaca-data-conversation.json

# Editor
.idea
*.swp
.vscode

# Other
.DS_Store
wandb
output
llavavid

checkpoints
project_checkpoints
debug_checkpoints
playground/data
playground/cc3m_llava34b_cap
ckpts*

.ipynb_checkpoints
Expand All @@ -35,4 +39,35 @@ chunyl_scripts

# Demo
serve_images/
notebooks/
notebooks/
logs
scripts/dist_*
logs/
submissions/
cn_scripts/
internal_project_checkpoints/
work_dirs
scripts/i18n/*
playground/.nfs028b000000010add00000001
HIP
playground/.nfs028b0000017bff2c00000012
scripts/qwen
scripts/vicuna
scripts/mistral
scripts/baseline_rep
scripts/cn_boli01_hl
scripts/cn_boli01_lf
scripts/cn_lf
scripts/cn_lq
scripts/cn_yg
scripts/cn_yg_hao
scripts/eva_encoder
scripts/i18n
scripts/i18n_higher_res
scripts/multi-images
scratchpad
build/
playground/*.json
mlx_configs/
data_processing/
# demo/
Empty file modified LICENSE
100644 → 100755
Empty file.
498 changes: 154 additions & 344 deletions README.md
100644 → 100755

Large diffs are not rendered by default.

53 changes: 53 additions & 0 deletions docs/LLaVA-NeXT-Interleave.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@

# LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models

## Contents
- [Demo](#demo)
- [Evaluation](#evaluation)

## Demo

> make sure you installed the LLaVA-NeXT model files via outside REAME.md

1. **Example model:** `lmms-lab/llava-next-interleave-7b`


To run a demo, execute:
```bash
# If you find any bug when running the demo, please make sure checkpoint path contains 'qwen'.
# You can try command like 'mv llava-next-interleave-7b llava-next-interleave-qwen-7b'
python playground/demo/interleave_demo.py --model_path path/to/ckpt
```

## Evaluation

### Preparation

Please download the evaluation data and its metadata from the following links:

1. **llava-interleave-bench:** [here](https://huggingface.co/datasets/lmms-lab/llava-interleave-bench).

Unzip eval_images.zip and there are Split1 and Split2 in it.
Organize the downloaded data into the following structure:
```

interleave_data
├── Split1
│ ├── ...
│ └── ...
|
├── Split2
| ├── ...
│ └── ...
├── multi_image_in_domain.json
├── multi_image_out_domain.json
└── multi_view_in_domain.json
```

### Inference and Evaluation
Example:
Please first edit /path/to/ckpt to the path of checkpoint, /path/to/images to the path of "interleave_data" in scripts/interleave/eval_all.sh and then run
```bash
bash scripts/interleave/eval_all.sh
```

81 changes: 81 additions & 0 deletions docs/LLaVA-NeXT-Video.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@

# LLaVA-NeXT: A Strong Zero-shot Video Understanding Model

## Contents
- [Demo](#demo)
- [Evaluation](#evaluation)

## Demo

> make sure you installed the LLaVA-NeXT model files via outside REAME.md

1. **Example model:** `lmms-lab/LLaVA-NeXT-Video-7B-DPO`

2. **Prompt mode:** `vicuna_v1` (use `mistral_direct` for `lmms-lab/LLaVA-NeXT-Video-34B-DPO`)

3. **Sampled frames:** `32` (Defines how many frames to sample from the video.)

4. **Spatial pooling stride:** `2` (With original tokens for one frame at 24x24, if stride=2, then the tokens for one frame are 12x12.)

5. **Spatial pooling mode:** `average` (Options: `average`, `max`.)

6. **Local video path:** `./data/llava_video/video-chatgpt/evaluation/Test_Videos/v_Lf_7RurLgp0.mp4`

To run a demo, execute:
```bash
bash scripts/video/demo/video_demo.sh ${Example model} ${Prompt mode} ${Sampled frames} ${Spatial pooling stride} ${Spatial pooling mode} grid True ${Video path at local}
```
Example:
```bash
bash scripts/video/demo/video_demo.sh lmms-lab/LLaVA-NeXT-Video-7B-DPO vicuna_v1 32 2 average no_token True playground/demo/xU25MMA2N4aVtYay.mp4
```

**IMPORTANT** Please refer to [Latest video model](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT-Video_0716.md) for the runnning of the latest model.

## Evaluation

### Preparation

Please download the evaluation data and its metadata from the following links:

1. **video-chatgpt:** [here](https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/quantitative_evaluation/README.md#video-based-generative-performance-benchmarking).
2. **video_detail_description:** [here](https://mbzuaiac-my.sharepoint.com/personal/hanoona_bangalath_mbzuai_ac_ae/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fhanoona%5Fbangalath%5Fmbzuai%5Fac%5Fae%2FDocuments%2FVideo%2DChatGPT%2FData%5FCode%5FModel%5FRelease%2FQuantitative%5FEvaluation%2Fbenchamarking%2FTest%5FHuman%5FAnnotated%5FCaptions%2Ezip&parent=%2Fpersonal%2Fhanoona%5Fbangalath%5Fmbzuai%5Fac%5Fae%2FDocuments%2FVideo%2DChatGPT%2FData%5FCode%5FModel%5FRelease%2FQuantitative%5FEvaluation%2Fbenchamarking&ga=1).
3. **activity_qa:** [here](https://mbzuaiac-my.sharepoint.com/personal/hanoona_bangalath_mbzuai_ac_ae/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fhanoona%5Fbangalath%5Fmbzuai%5Fac%5Fae%2FDocuments%2FVideo%2DChatGPT%2FData%5FCode%5FModel%5FRelease%2FData%2FActivityNet%5FTest%2D1%2D3%5Fvideos%2Ezip&parent=%2Fpersonal%2Fhanoona%5Fbangalath%5Fmbzuai%5Fac%5Fae%2FDocuments%2FVideo%2DChatGPT%2FData%5FCode%5FModel%5FRelease%2FData&ga=1) and [here](https://github.com/MILVLG/activitynet-qa/tree/master/dataset).

Organize the downloaded data into the following structure:
```
LLaVA-NeXT
├── llava
├── scripts
└── data
└── llava_video
├── video-chatgpt
│ ├── Test_Videos
│ ├── consistency_qa.json
│ ├── consistency_qa_test.json
│ ├── consistency_qa_train.json
├── video_detail_description
│ └── Test_Human_Annotated_Captions
└── ActivityNet-QA
├── all_test
├── test_a.json
└── test_b.json
```

### Inference and Evaluation

Example for video detail description evaluation (additional scripts are available in `scripts/eval`):
```bash
bash scripts/video/eval/video_detail_description_eval_shard.sh ${Example model} ${Prompt mode} ${Sampled frames} ${Spatial pooling stride} True 8
```
Example:
```bash
bash scripts/eval/video_detail_description_eval_shard.sh liuhaotian/llava-v1.6-vicuna-7b vicuna_v1 32 2 True 8
```

### GPT Evaluation Example (Optional if the above step is completed)

Assuming you have `pred.json` (model-generated predictions) for model `llava-v1.6-vicuna-7b` at `./work_dirs/eval_video_detail_description/llava-v1.6-vicuna-7b_vicuna_v1_frames_32_stride_2`:
```bash
bash scripts/video/eval/video_description_eval_only.sh llava-v1.6-vicuna-7b_vicuna_v1_frames_32_stride_2
```
42 changes: 42 additions & 0 deletions docs/LLaVA-NeXT-Video_0716.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
## LLaVA-NeXT-Video is upgraded 🚀

In our [LLaVA-Video blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/) released this April, we shared two key observations:
- 🎬 AnyRes provides a shared and flexible representation between images and videos, and thus accommodates capability transfer between the two most common vision signals. Therefore, stronger image LMMs can naturally lead to stronger zero-shot video LMMs.
- 🗂️ There is a lack of high-quality language-video data, including video instruction-following data, and thus naive tuning on existing public data at that time results in performance degradation. Therefore, there is an urgent need to build high-quality video captions and QA datasets to train LMMs for improved video performance.

Based on the insights, the new LLaVA-NeXT-Video in this release improves from two aspects:

- 🎬 A stronger image LMMs ([LLaVA-NeXT-32B-Qwen](https://huggingface.co/lmms-lab/llava-next-qwen-32b)), which is built by initializing from Qwen-1.5 32B LLM. We further initialize our video training from this image checkpoint.
- 🗂️ A new high-quality video dataset with 830k samples. It is combined with LLaVA-1.6 image training data, and applying the same image-video mixed training procedure leads to the new video model.
The new model achieves the best open-source performance in several video benchmarks including [Video-MME](https://video-mme.github.io/home_page.html#leaderboard).

### Resources
- **Model Card**: [LLaVA-NeXT-Video-32B-Qwen on Hugging Face](https://huggingface.co/lmms-lab/LLaVA-NeXT-Video-32B-Qwen)
- **Inference Script**:
```bash
bash scripts/video/demo/video_demo.sh lmms-lab/LLaVA-NeXT-Video-32B-Qwen qwen_1_5 32 2 average grid True playground/demo/xU25MMA2N4aVtYay.mp4
```

### Evaluation Results
| Model | NextQA-MC | video-mme(overall) | | Egochema | Perception Test (val) |
|-----------------------------|-----------|--------------------|--------|----------|------------------------|
| | | w/o subs | w subs | | |
| **Proprietary** | | | | | |
| GPT-4o | - | 71.9 | 77.2 | 72.2 | - |
| Gemini 1.5 Pro | - | 75.0 | 81.3 | 72.2 | - |
| **Open-Source** | | | | | |
| VideoLLaMA 2 (8x7B) | 76.3* | 47.9 | 50.3 | 53.3 | 51.2* |
| VILA-1.5-34B | 67.89* | 60.1 | 61.1 | 58.04* | 54 |
| LLaVA-NeXT-Video (Qwen-32B) | 77.31 | 60.2 | 63.0 | 60.85 | 59.38 |

_*Results are reproduced by [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval). Please refer to the lmms-eval to reproduce the results._

### Citations
```bibtex
@misc{zhang2024llavanextvideo,
title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
month={April},
year={2024}
}
91 changes: 91 additions & 0 deletions docs/LLaVA-NeXT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild

## Quick Start With HuggingFace
First please install our repo with code and environments: `pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git`

Here is a quick inference code using [`llavanext-llama3-8B`](https://huggingface.co/lmms-lab/llama3-llava-next-8b) as an example. You will need to install [`flash-attn`](https://github.com/Dao-AILab/flash-attention) to use this code snippet. If you don't want to install it, you can set `attn_implementation=None` when load_pretrained_model
```python
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

from PIL import Image
import requests
import copy
import torch

pretrained = "lmms-lab/llama3-llava-next-8b"
model_name = "llava_llama3"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args

model.eval()
model.tie_weights()

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]

conv_template = "llava_llama_3" # Make sure you use correct chat template for different models
question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]


cont = model.generate(
input_ids,
images=image_tensor,
image_sizes=image_sizes,
do_sample=False,
temperature=0,
max_new_tokens=256,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)
# The image shows a radar chart, also known as a spider chart or a web chart, which is a type of graph used to display multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. Each axis represents a different variable, and the values are plotted along each axis and connected to form a polygon.\n\nIn this particular radar chart, there are several axes labeled with different variables, such as "MM-Vet," "LLaVA-Bench," "SEED-Bench," "MMBench-CN," "MMBench," "TextVQA," "VizWiz," "GQA," "BLIP-2," "InstructBLIP," "Owen-VL-Chat," and "LLaVA-1.5." These labels suggest that the chart is comparing the performance of different models or systems across various benchmarks or tasks, such as machine translation, visual question answering, and text-based question answering.\n\nThe chart is color-coded, with each color representing a different model or system. The points on the chart are connected to form a polygon, which shows the relative performance of each model across the different benchmarks. The closer the point is to the outer edge of the
```

## Evaluation

**Install the evaluation package:**
```bash
# make sure you installed the LLaVA-NeXT model files via outside REAME.md
pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
```

### Check the evaluation results with [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)
Our models' evaluation results can be fully reproduced by using the [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval) toolkit. After you install lmms-eval and llava, you can run the evaluation using the following commands. To run following commands, you will have to install [`flash-attn`](https://github.com/Dao-AILab/flash-attention). If you do not want to install it, you can disable the flash-attn by specifying it in `--model_args pretrained=lmms-lab/llama3-llava-next-8b,conv_template=llava_llama_3,attn_implementation=None`.

Please note that different torch versions might causing the results to vary.

```shell
# Evaluating Llama-3-LLaVA-NeXT-8B on multiple datasets
accelerate launch --num_processes=8 \
-m lmms_eval \
--model llava \
--model_args pretrained=lmms-lab/llama3-llava-next-8b,conv_template=llava_llama_3 \
--tasks ai2d,chartqa,docvqa_val,mme,mmbench_en_dev \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_next \
--output_path ./logs/

# Evaluating LLaVA-NeXT-72B on multiple datasets
accelerate launch --num_processes=1 \
-m lmms_eval \
--model llava \
--model_args pretrained=lmms-lab/llava-next-72b,conv_template=qwen_1_5,model_name=llava_qwen,device_map=auto \
--tasks ai2d,chartqa,docvqa_val,mme,mmbench_en_dev \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_next \
--output_path ./logs/
```
Loading