Yulu Qin,¹* Dheeraj Varghese,²* Adam Dahlgren Lindström,³ Lucia Donatelli,⁴ Kanishka Misra,⁵† and Najoung Kim¹†
¹Boston University, ²University of Amsterdam, ³Umeå University, ⁴Vrije Universiteit Amsterdam, ⁵Toyota Technological Institute at Chicago
*, † Equal contribution
To set up the environment with the necessary dependencies, run the following:
conda create -n taxonomiGQA python=3.10
conda activate taxonomiGQA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
⚠️ The above is for CUDA 11.8 runtime.
TaxonomiGQA is a dataset constructed on top of GQA following a three-step pipeline.
It contains:
- 1342 GQA images: a subset of images from the original GQA dataset
- 148020 questions in Two QA formats
- Text-only QA
- Each image is represented by a scene description.
- Questions refer only to the textual description.
- Image-QA
- Matches the original GQA setup: image + visual question.
- Text-only QA
You have two options for obtaining the TaxonomiGQA dataset:
- Download from Hugging Face (Recommended): The processed datasets(both text-only and image-QA splits) are readily avilable on Huggingface at
tin-lab/TaxonomiGQA. The inference script (run_inference.py) will automatically download these when executed.
Note: If you only want to load the dataset (without running our inference code), here’s an example:
from datasets import load_dataset
import datasets
print(datasets.__version__) # pip install datasets==3.5.0
repo_id = "tin-lab/TaxonomiGQA"
text_only = False
ds = load_dataset(repo_id, "text_only" if
text_only else "image_qa", split="train", trust_remote_code=True)
print(ds.column_names)
print(ds[0]['question'])
print(ds[0]['image'])
# >>> 3.5.0
# >>> ['question_id', 'image_id', 'question', 'original_question', 'question_type', 'substitution_hop', 'argument', 'original_arg', 'arg-scene-form', 'arg-q-form', 'scene_description', 'ground_truth', 'ground_truth_long', 'image']
# >>> Is there a bridge in the picture?
# >>> <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=800x600 at 0x14904B516380>
- Regenerate QAs from Scratch: If you prefer to regenerate the QAs yourself, execute the following script:
python multimodal-representations/src/preprocessing/run_pipeline.py
This will output two .csv files in your working directory:
- model_inference_input_text.csv
- model_inference_input_image.csv
Note: To run inference based on the QA data generated using the aforementioned script, you will need to provide the images corresponding to the 1342 TaxonomiGQA subset. These images can be obtained in one of two ways:
- Download the full GQA images zip file from the original website
- Download only the 1342 TaxonomiGQA images directly from our
tin-lab/TaxonomiGQAHugging Face dataset. You would then point therun_pipeline.pyscript to the local directory of these images and QA files previously generated.
Model and experiment configurations are defined in YAML files under
src/configs/. A sample config file vlm_text_qwen2.5VL.yaml is provided. You can run:
cd src/configs/
python generate_config.py
to generate all the config files needed for this paper.
To run inference for a specific model, use:
python src/prompting/run_inference.py --config="src/configs/vlm_text_qwen2.5VL.yaml"
This script loads data automatically from the Huggingface datasets repository:
tin-lab/TaxonomiGQA
and writes model outputs to:
data/behavioral-data/vlm_text_qwen2.5VL.csv.
Each model will produce a separate .csv file named after its config.
After running inference with each model individually, you will obtain separate .csv files containing the model predictions and whether each answer from the model is correct. These individual results can be aggregated into a single file that summarizes model performance across all models by simply running:
python data/behavioral-data/aggregate_model_res.py
The aggregated results (across multiple models) will be stored as:
data/behavioral-data/model_inference_output.csv, which will serve as an input file for later analyses.
To get plots, run the following R script: analysis/taxonomigqa-results.R (requires the tidyverse set of packages).
Generate stimuli using:
python src/flatten_taxonomy.py # creates unique hypernym pairs
python src/taxomps-computemax-stimuli.py # creates all stimuliRun models using:
bash scripts/taxomps.shThis script saves results in the following directories:
data/results/taxomps-hypernym-qa-- for hypernyms (positive samples)data/results/taxomps-ns-all-qa-- negative samplesdata/results/taxomps-swapped-qa-- for cases where we swap hypernym and hyponym (unused in paper).
To get plots, run the following R script: analysis/gqa-taxomps-analysis.R
The following script runs the Park et al., method and saves results in:
data/results/pair-rsa.csv-- for RSA metricsdata/reps/<modelname>/long-mats/-- for pairwise similarities
bash scripts/rsa.shTo get plots, run the following R scripts:
analysis/rsa-plots.R-- for matricesanalysis/rsa-analysis.Rfor tests
The following runs the embedding similarity analysis:
python src/embedding_analysis/embedding_similarity.py \
--emb_unemb emb \
--results_dir data/results/embedding_analysis/Get Qwen2.5 data by running analysis/qwen-fine-grained.R, which saves the set of questions that have the same ground truth answer ('No'), and answers produced for these questions by Qwen2.5 LM as well as VLM (in data/gqa_dataset/qwen-lm-base-correct-no.csv and data/gqa_dataset/qwen-vl-base-correct-no.csv) then run:
bash scripts/cwe-sims.shThis will save results in data/results/gqa-cwe-sims-all/<modelname>
To get plots, use the following R script: analysis/token-sim-analysis-qwen-all-no-questions.R
Data used: same as the previous section (Contextualized Representation Similarity), but now for PCA.
Run src/pca-full-qwen-vl.ipynb and src/pca-full-qwen.ipynb to run PCA-analysis and save data.
Then, run analysis/pca-analysis.R to get pca plots.
To compute visual similarity between taxonomy nodes using Qwen2.5-VL
cd src/similarity_analysis/code/
python compute_taxonomy_sims_image.py \
--nonleaf_out_pkl ../data/qwen_nl_node_to_embeds.pkl \
--leaf_out_pkl ../data/qwen_leaf_node_to_embeds.pkl \
--sim_csv_out ../data/qwen_substituted_edge_accuracy.csv \--nonleaf_out_pkl: Path to save or load non-leaf node image embeddings (as a pickle file).--leaf_out_pkl: Path to save or load leaf node image embeddings (as a pickle file).--sim_csv_out: Output CSV file to store similarity scores between concept pairs.
- Taxonomy:
data/arg_hypernyms.json– maps leaf concepts to their ancestors. - Annotations:
data/combined.json– maps concepts to THINGS image folders. - Images: Located under
data/THINGS/object_images/.
- Embeddings for each concept (leaf and non-leaf) are saved as pickle files.
- CSV file with computed cosine similarity scores between concept pairs.
To generate plot (Fig. 6) and run statistical analysis, use analysis/viz-sim.R
If you use the code in this work or use our results, please cite us using:
@article{qin2025taxonomi,
title={Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It},
author={Qin, Yulu and Varghese, Dheeraj and Lindström, Dahlgren Adam and Donatelli, Lucia and Misra, Kanishka and Kim, Najoung},
journal={arXiv preprint},
year={2025}
}Yulu Qin ([email protected])
