Official Implementation for "SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs".
Jinhong Deng, Wen Li*, Joey Tianyi Zhou, Yang He
Abstract: Multimodal Large Language Models (MLLMs) typically process a large number of visual tokens, leading to considerable computational overhead, even though many of these tokens are redundant. Existing visual token pruning methods primarily focus on selecting the most salient tokens based on attention scores, resulting in the semantic incompleteness of the selected tokens. In this paper, we propose a novel visual token pruning strategy, called Saliency-Coverage Oriented token Pruning for Efficient MLLMs (SCOPE), to jointly model both the saliency and coverage of the selected visual tokens to better preserve semantic completeness. Specifically, we introduce a set-coverage for a given set of selected tokens, computed based on the token relationships. We then define a token-coverage gain for each unselected token, quantifying how much additional coverage would be obtained by including it. By integrating the saliency score into the token-coverage gain, we propose our SCOPE score and iteratively select the token with the highest SCOPE score. We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLaVA-1.5 and LLaVA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches.
- [2025.10.27] We add a Chat-Demo for SCOPE, enabling users to manually select visual token types including scratch tokens, salient tokens, and SCOPE tokens. The users can intuitively observe how different visual tokens selections influence the model’s final response.
- [2025.10.27] We release the code of SCOPE for LLaVA.
- [2025.10.27] We release Paper and this GitHub repo.
- Install the environment of LLaVA.
conda create -n scope python=3.10 -y
conda activate scope
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
git clone [email protected]:kinredon/SCOPE.git
cd SCOPE
pip install -r requirements.txt
cd LLaVA
pip install -e .
cd ..
- Install our SCOPE method by running the following command:
pip install -e .
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
from scope import scope
model_path = "liuhaotian/llava-v1.5-7b"
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path=model_path,
model_base=None,
model_name=get_model_name_from_path(model_path)
)
## 64 tokens are retained
model = scope(model, token_num=64)
- Results on LLaVA 1.5 7B with 64 tokens:
bash run_scope_llava_7b.sh 64
- Results on LLaVA-Next 7B with 160 tokens:
bash run_scope_llava_next_7b.sh 160
Main Table Results Log (📂Google Drive)
Logs for main tables are results provided in google drive for reference.
| Table | Explanation |
|---|---|
| Table 1 | Results on LLaVA 1.5 7B. |
| Table 2 | Results on LLaVA-Next 7B. |
| Table 6 | Results on LLaVA 1.5 13B. |
| Table 7 | Results on LLaVA-Next 13B. |
-
This work is built upon LLaVA, Lmms-Eval. We thank them for their excellent open-source contributions.
-
We also thank VisionZip, DivPrune, FastV, SparseVLM, and others for their contributions, which have provided valuable insights.
If you find this project useful in your research, please consider citing:
@article{deng2025scope,
title={SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs},
author={Deng Jinhong, Li Wen, and Zhou, Joey Tianyi, and He, Yang},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}