ColQwen3: Efficient Document Retrieval with Qwen3-VL-2B-Instruct 👀

This is the v0.1 version trained with batch_size 32 for 1 epoch and with the updated pad token

Welcome to follow and star! ⭐️⭐️⭐️

📜 News

[2025.12.10] 🎉🎉 I have released the ColQwen3-v0.2 model based on ColQwen3-Base

[2025.12.02] 🎉🎉 I have released the ColQwen3-v0.1 model based on ColQwen3-Base

[2025.12.02] 🎉🎉 I have released the ColQwen3-Base model based on Qwen3-VL-2B-Instruct

Related Work

This repository contains the code used for training the ColQwen3, which is a vision retriever based on the ColBERT architecture and the Qwen3-VL-2B model.

Introduction

ColQwen3 is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It is a Qwen3-VL-2B extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in my repository

Version specificity

This model takes dynamic image resolutions in input and does not resize them, changing their aspect ratio as in ColPali. Maximal resolution is set so that 768 image patches are created at most. Experiments show clear improvements with larger amounts of image patches, at the cost of memory requirements.

This version is trained with colpali-engine==0.3.14.

Data is the same as the ColPali data described in the paper.

⚙️ Setup

All models are trained for only 1 epoch on the train set. Unless specified otherwise, we train models in bfloat16 format, use low-rank adapters (LoRA) with alpha=32 and r=32 on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a paged_adamw_8bit optimizer. We train on 2*NVIDIA A100 80GB GPUs setup with data parallelism, a learning rate of 5e-5 with linear decay with 2.5% warmup steps, and a batch size of 32.

We used Python 3.10 and PyTorch 2.4 to train and test our models, but the codebase is compatible with Python >=3.9 and recent PyTorch versions. To install the package, run:

pip install colpali-engine # from PyPi
pip install git+https://github.com/illuin-tech/colpali # from source

Mac users using MPS with the ColQwen models have reported errors with torch 2.6.0. These errors are fixed by downgrading to torch 2.5.1.

Warning

For ColPali versions above v1.0, make sure to install the colpali-engine package from source or with a version above v0.2.0.

Usage 🤗

Make sure colpali-engine is installed from source or with a version superior to 0.3.4. transformers version must be >= 4.57.1.(compatible with Qwen3-VL interface)

pip install git+https://github.com/Mungeryang/colqwen3

import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import ColQwen3, ColQwen3Processor

model = ColQwen3.from_pretrained(
    "goodman2001/colqwen3-v0.1",
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()
processor = ColQwen3Processor.from_pretrained("goodman2001/colqwen3-v0.1")

# Your inputs
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "Is attention really all you need?",
    "What is the amount of bananas farmed in Salvador?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

Benchmarking

To benchmark ColQwen3 on the ViDoRe leaderboard, use the mteb package.

Training

To keep a lightweight repository, only the essential packages were installed. In particular, you must specify the dependencies to use the training script for ColPali. You can do this using the following command:

pip install -r requirements.txt

pip install mteb==1.39.7

pip install "colpali-engine[train]"

All the model configs used can be found in scripts/configs/ and rely on the configue package for straightforward configuration. They should be used with the train_colbert.py script.

🔽 Example : Local training

accelerate launch --multi-gpu scripts/configs/qwen3/train_colqwen3_model.py

ColQwen3 Test Results

🎉🎉 [2025.12.08] I used mteb to evaluate(NDCG@5) my ColQwen3-v0.1 retriever on the ViDoRe benchmark v2.

model	BioMedicalLectures-french	BioMedicalLectures-spanish	BioMedicalLectures-english	BioMedicalLectures-german	EconomicsReports-french	EconomicsReports-spanish	EconomicsReports-english	EconomicsReports-german	ESGReports-french	ESGReports-spanish	ESGReports-english	ESGReports-german	ESGReportsHL
colqwen3-v0.1	55.32	56.35	58.87	51.73	40.77	41.38	57.22	44.38	50.51	47.12	51.34	48.08	55.75
colqwen3-v0.2	57.40	58.67	62.37	56.45	50.18	52.90	63.24	53.58	52.97	50.89	50.81	52.61	52.87

⚙️I used mteb to evaluate my ColQwen3-v0.1 retriever on the ViDoRe benchmark.

Model	ArxivQ	DocQ	InfoQ	TabF	TATQ	Shift	AI	Energy	Gov.	Health.	Avg.
`Unstructured` (text-only)
- BM25	-	34.1	-	-	44.0	59.6	90.4	78.3	78.8	82.6	-
- BGE-M3	-	28.4 (↓5.7)	-	-	36.1 (↓7.9)	68.5 (↑8.9)	88.4 (↓2.0)	76.8 (↓1.5)	77.7 (↓1.1)	84.6 (↑2.0)	-
`Unstructured` + OCR
- BM25	31.6	36.8	62.9	46.5	62.7	64.3	92.8	85.9	83.9	87.2	65.5
- BGE-M3	31.4 (↓0.2)	25.7 (↓11.1)	60.1 (↓2.8)	70.8 (↑24.3)	50.5 (↓12.2)	73.2 (↑8.9)	90.2 (↓2.6)	83.6 (↓2.3)	84.9 (↑1.0)	91.1 (↑3.9)	66.1 (↑0.6)
`Unstructured` + Captioning
- BM25	40.1	38.4	70.0	35.4	61.5	60.9	88.0	84.7	82.7	89.2	65.1
- BGE-M3	35.7 (↓4.4)	32.9 (↓5.4)	71.9 (↑1.9)	69.1 (↑33.7)	43.8 (↓17.7)	73.1 (↑12.2)	88.8 (↑0.8)	83.3 (↓1.4)	80.4 (↓2.3)	91.3 (↑2.1)	67.0 (↑1.9)
Contrastive VLMs
Jina-CLIP	25.4	11.9	35.5	20.2	3.3	3.8	15.2	19.7	21.4	20.8	17.7
Nomic-vision	17.1	10.7	30.1	16.3	2.7	1.1	12.9	10.9	11.4	15.7	12.9
SigLIP (Vanilla)	43.2	30.3	64.1	58.1	26.2	18.7	62.5	65.7	66.1	79.1	51.4
SigLIP (Vanilla)	43.2	30.3	64.1	58.1	26.2	18.7	62.5	65.7	66.1	79.1	51.4
BiSigLIP (+fine-tuning)	58.5 (↑15.3)	32.9 (↑2.6)	70.5 (↑6.4)	62.7 (↑4.6)	30.5 (↑4.3)	26.5 (↑7.8)	74.3 (↑11.8)	73.7 (↑8.0)	74.2 (↑8.1)	82.3 (↑3.2)	58.6 (↑7.2)
BiPali (+LLM)	56.5 (↓2.0)	30.0 (↓2.9)	67.4 (↓3.1)	76.9 (↑14.2)	33.4 (↑2.9)	43.7 (↑17.2)	71.2 (↓3.1)	61.9 (↓11.7)	73.8 (↓0.4)	73.6 (↓8.8)	58.8 (↑0.2)
ColPali (+Late Inter.)	79.1 (↑22.6)	54.4 (↑24.5)	81.8 (↑14.4)	83.9 (↑7.0)	65.8 (↑32.4)	73.2 (↑29.5)	96.2 (↑25.0)	91.0 (↑29.1)	92.7 (↑18.9)	94.4 (↑20.8)	81.3 (↑22.5)
Ours
Colqwen3 (+Late Inter.)	80.1 (↑1.0)	55.8 (↑1.4)	86.7 (↑5.9)	82.1 (↓1.8)	70.8 (↑5.0)	75.9 (↑2.7)	99.1 (↑2.9)	95.6 (↑4.6)	96.1 (↑3.4)	96.8 (↑2.4)	83.9 (↑2.6)

License

ColQwen3's vision language backbone model (Qwen3-VL is under apache2.0 license. The adapters attached to the model are under MIT license.

Contact

Mungeryang: [email protected]/[email protected]

Acknowledgments

❤️❤️❤️

Thanks to the Colpali team and Qwen team for their excellent open-source works! I accomplished this work by standing on the shoulders of giants~

👆👍

Citation

ColPali: Efficient Document Retrieval with Vision Language Models

Authors: Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution)

@misc{faysse2024colpaliefficientdocumentretrieval,
      title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
      author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
      year={2024},
      eprint={2407.01449},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2407.01449}, 
}

@misc{macé2025vidorebenchmarkv2raising,
      title={ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval}, 
      author={Quentin Macé and António Loison and Manuel Faysse},
      year={2025},
      eprint={2505.17166},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2505.17166}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
colpali_engine		colpali_engine
scripts		scripts
tests		tests
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ColQwen3: Efficient Document Retrieval with Qwen3-VL-2B-Instruct 👀

This is the v0.1 version trained with batch_size 32 for 1 epoch and with the updated pad token

📜 News

Related Work

Introduction

Version specificity

⚙️ Setup

Usage 🤗

Benchmarking

Training

ColQwen3 Test Results

License

Contact

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Languages

License

Mungeryang/colqwen3

Folders and files

Latest commit

History

Repository files navigation

ColQwen3: Efficient Document Retrieval with Qwen3-VL-2B-Instruct 👀

This is the v0.1 version trained with batch_size 32 for 1 epoch and with the updated pad token

📜 News

Related Work

Introduction

Version specificity

⚙️ Setup

Usage 🤗

Benchmarking

Training

ColQwen3 Test Results

License

Contact

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages