This is the official repository for our paper
π βIn-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understandingβ
π Project site
ChartScope lets you automatically generate synthetic chart data via Qwen3 and easily download the ChartDQA benchmark. Stay tuned for more updates! π₯
This repo offers an automated, efficient pipeline powered by a text-only LLM. With a single command, you can generate:
- π Chart images
- ποΈ Raw JSON data
- β QuestionβAnswer pairs
- π Python scripts
- π Background stories
- July 18, 2025 β Data-generation pipeline & ChartDQA benchmark are now released! π
- OS: Ubuntu 24.04.2 LTS
- CUDA: 12.6
- GPUs: Tested on 4Γ NVIDIA L40 or 8Γ NVIDIA H100
Requires: Python β₯ 3.10
# Core deps
pip install openai pathlib tqdm subprocess joblib threading
pip install -U "huggingface_hub[cli]"
# PyTorch for CUDA 12.6
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 \
--index-url https://download.pytorch.org/whl/cu126
# Flash attention
pip install flash-attn==2.7.3
# vLLM & transformers
pip install vllm==0.9.0.1 transformers==4.51.3
pip install accelerate einops
mkdir model-weights
huggingface-cli download Qwen/Qwen3-32B --local-dir model-weights/Qwen3-32B
Note: Our paperβs data were generated with OpenAI GPT. This pipeline uses open-source Qwen3 for public use. You can also change Qwen3 to GPT by simply specifying
GPT_DEPLOY_NAME="gpt-o4-mini"
in all files in scripts_api.
bash launch.sh
# OR
vllm serve \
model-weights/Qwen3-32B/ \
--tensor-parallel-size 4 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
--max-model-len 131072
Please check the GPT version that you are going to use first in script.
python3 scripts_api/generate_json_template.py
Please check the GPT version that you are going to use first in script.
python3 scripts_api/generate_json_data_and_qa.py
Please check the GPT version that you are going to use first in script.
python3 scripts_api/generate_py_script.py
You can run this in parallel with step 4.
python3 tools/data/check_json_qa_format.py
python3 tools/data/merge_folders.py
Please adjust the number of worker accordingly in script.
python3 tools/data/generate_chart_image.py
JSON data and qa generation: 2.2 min for one chart type and one pair Python script genearation: 10 min for one chart type and one script
Task | Time per chart type |
---|---|
Template generation | 4.3 min |
JSON data & QA generation | 2.2 min / per pair |
Python script generation | 10 min / per script |
We provide two annotation formatsβJSON and JSONLβwith identical QA pairs.
Use test.json
for full evaluation and test_small.json
for a quick run on 1,000 sampled QA pairs.
>ChartDQA
βββ data
β βββ Area_Chart/
β β βββ chart/
β β βββ 000000_script_matplotlib_0.png
β β βββ ...
β β βββ csv/
β β βββ 000000.csv
β β βββ ...
β β βββ json/
β β βββ 000000.json
β β βββ ...
β β βββ qa/
β β βββ 000000.json
β β βββ ...
β βββ Bar_Chart/
β βββ Box_Plot/
β βββ ...
βββ test.json
βββ test.jsonl
βββ test_small.json
βββ test_small.jsonl
If you find ChartScope useful, please cite:
@inproceedings{fan2025chartscope,
title={On pre-training of multimodal language models customized for chart understanding},
author={Fan, Wan-Cyuan and Chen, Yen-Chun and Liu, Mengchen and Jacobson, Alexander and Yuan, Lu and Sigal, Leonid},
booktitle={NeurIPS Workshop on Adaptive Foundation Models},
year={2024}
}
This project is licensed under the MIT License.