Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It

Yulu Qin,¹* Dheeraj Varghese,²* Adam Dahlgren Lindström,³ Lucia Donatelli,⁴ Kanishka Misra,⁵† and Najoung Kim¹†

¹Boston University, ²University of Amsterdam, ³Umeå University, ⁴Vrije Universiteit Amsterdam, ⁵Toyota Technological Institute at Chicago

*, † Equal contribution

Paper | Website | Dataset 🤗

Requirements

To set up the environment with the necessary dependencies, run the following:

conda create -n taxonomiGQA python=3.10
conda activate taxonomiGQA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

⚠️ The above is for CUDA 11.8 runtime.

TaxonomiGQA

TaxonomiGQA is a dataset constructed on top of GQA following a three-step pipeline.

It contains:

1342 GQA images: a subset of images from the original GQA dataset
148020 questions in Two QA formats
- Text-only QA
  - Each image is represented by a scene description.
  - Questions refer only to the textual description.
- Image-QA
  - Matches the original GQA setup: image + visual question.

You have two options for obtaining the TaxonomiGQA dataset:

Download from Hugging Face (Recommended): The processed datasets(both text-only and image-QA splits) are readily avilable on Huggingface at tin-lab/TaxonomiGQA. The inference script (run_inference.py) will automatically download these when executed.

Note: If you only want to load the dataset (without running our inference code), here’s an example:

from datasets import load_dataset 
import datasets
print(datasets.__version__) # pip install datasets==3.5.0
repo_id = "tin-lab/TaxonomiGQA"
text_only = False
ds = load_dataset(repo_id, "text_only" if 
text_only else "image_qa", split="train", trust_remote_code=True) 

print(ds.column_names)
print(ds[0]['question'])
print(ds[0]['image'])

# >>> 3.5.0
# >>> ['question_id', 'image_id', 'question', 'original_question', 'question_type', 'substitution_hop', 'argument', 'original_arg', 'arg-scene-form', 'arg-q-form', 'scene_description', 'ground_truth', 'ground_truth_long', 'image']
# >>> Is there a bridge in the picture?
# >>> <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=800x600 at 0x14904B516380>

Regenerate QAs from Scratch: If you prefer to regenerate the QAs yourself, execute the following script:

python multimodal-representations/src/preprocessing/run_pipeline.py

This will output two .csv files in your working directory:

model_inference_input_text.csv
model_inference_input_image.csv

Note: To run inference based on the QA data generated using the aforementioned script, you will need to provide the images corresponding to the 1342 TaxonomiGQA subset. These images can be obtained in one of two ways:

Download the full GQA images zip file from the original website
Download only the 1342 TaxonomiGQA images directly from our tin-lab/TaxonomiGQA Hugging Face dataset. You would then point the run_pipeline.py script to the local directory of these images and QA files previously generated.

Inference Configuration

Model and experiment configurations are defined in YAML files under src/configs/. A sample config file vlm_text_qwen2.5VL.yaml is provided. You can run:

cd src/configs/
python generate_config.py

to generate all the config files needed for this paper.

Running Inference

To run inference for a specific model, use:

python src/prompting/run_inference.py --config="src/configs/vlm_text_qwen2.5VL.yaml"

This script loads data automatically from the Huggingface datasets repository: tin-lab/TaxonomiGQA and writes model outputs to: data/behavioral-data/vlm_text_qwen2.5VL.csv. Each model will produce a separate .csv file named after its config.

Aggregated Results

After running inference with each model individually, you will obtain separate .csv files containing the model predictions and whether each answer from the model is correct. These individual results can be aggregated into a single file that summarizes model performance across all models by simply running:

python data/behavioral-data/aggregate_model_res.py

The aggregated results (across multiple models) will be stored as: data/behavioral-data/model_inference_output.csv, which will serve as an input file for later analyses.

Plotting

To get plots, run the following R script: analysis/taxonomigqa-results.R (requires the tidyverse set of packages).

Analyses

TAXOMPS

Generate stimuli using:

python src/flatten_taxonomy.py # creates unique hypernym pairs

python src/taxomps-computemax-stimuli.py # creates all stimuli

Run models using:

bash scripts/taxomps.sh

This script saves results in the following directories:

data/results/taxomps-hypernym-qa -- for hypernyms (positive samples)
data/results/taxomps-ns-all-qa -- negative samples
data/results/taxomps-swapped-qa -- for cases where we swap hypernym and hyponym (unused in paper).

To get plots, run the following R script: analysis/gqa-taxomps-analysis.R

RSA Analysis

The following script runs the Park et al., method and saves results in:

data/results/pair-rsa.csv -- for RSA metrics
data/reps/<modelname>/long-mats/ -- for pairwise similarities

bash scripts/rsa.sh

To get plots, run the following R scripts:

analysis/rsa-plots.R-- for matrices
analysis/rsa-analysis.R for tests

Embedding Analysis:

The following runs the embedding similarity analysis:

python src/embedding_analysis/embedding_similarity.py \
  --emb_unemb emb \
  --results_dir data/results/embedding_analysis/

Contextualized Representational Similarity Analysis

Get Qwen2.5 data by running analysis/qwen-fine-grained.R, which saves the set of questions that have the same ground truth answer ('No'), and answers produced for these questions by Qwen2.5 LM as well as VLM (in data/gqa_dataset/qwen-lm-base-correct-no.csv and data/gqa_dataset/qwen-vl-base-correct-no.csv) then run:

bash scripts/cwe-sims.sh

This will save results in data/results/gqa-cwe-sims-all/<modelname>

To get plots, use the following R script: analysis/token-sim-analysis-qwen-all-no-questions.R

PCA

Data used: same as the previous section (Contextualized Representation Similarity), but now for PCA.

Run src/pca-full-qwen-vl.ipynb and src/pca-full-qwen.ipynb to run PCA-analysis and save data.

Then, run analysis/pca-analysis.R to get pca plots.

Image Similarity Analysis

To compute visual similarity between taxonomy nodes using Qwen2.5-VL

cd src/similarity_analysis/code/

python compute_taxonomy_sims_image.py \
  --nonleaf_out_pkl ../data/qwen_nl_node_to_embeds.pkl \
  --leaf_out_pkl ../data/qwen_leaf_node_to_embeds.pkl \
  --sim_csv_out ../data/qwen_substituted_edge_accuracy.csv \

Arguments

--nonleaf_out_pkl: Path to save or load non-leaf node image embeddings (as a pickle file).
--leaf_out_pkl: Path to save or load leaf node image embeddings (as a pickle file).
--sim_csv_out: Output CSV file to store similarity scores between concept pairs.

Input Data

Taxonomy: data/arg_hypernyms.json – maps leaf concepts to their ancestors.
Annotations: data/combined.json – maps concepts to THINGS image folders.
Images: Located under data/THINGS/object_images/.

Output

Embeddings for each concept (leaf and non-leaf) are saved as pickle files.
CSV file with computed cosine similarity scores between concept pairs.

To generate plot (Fig. 6) and run statistical analysis, use analysis/viz-sim.R

Citation

If you use the code in this work or use our results, please cite us using:

@article{qin2025taxonomi,
        title={Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It},
        author={Qin, Yulu and Varghese, Dheeraj and Lindström, Dahlgren Adam and Donatelli, Lucia and Misra, Kanishka and Kim, Najoung},
        journal={arXiv preprint},
        year={2025}
        }

Contact

Yulu Qin ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 256 Commits
analysis		analysis
data		data
imgs		imgs
notebooks		notebooks
pca-data		pca-data
plots		plots
reps		reps
scripts		scripts
src		src
unembed_results_filtered		unembed_results_filtered
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
multimodal-representations.Rproj		multimodal-representations.Rproj
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It

Requirements

TaxonomiGQA

Inference Configuration

Running Inference

Aggregated Results

Plotting

Analyses

TAXOMPS

RSA Analysis

Embedding Analysis:

Contextualized Representational Similarity Analysis

PCA

Image Similarity Analysis

Arguments

Input Data

Output

Citation

Contact

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

tinlaboratory/taxonomigqa

Folders and files

Latest commit

History

Repository files navigation

Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It

Requirements

TaxonomiGQA

Inference Configuration

Running Inference

Aggregated Results

Plotting

Analyses

TAXOMPS

RSA Analysis

Embedding Analysis:

Contextualized Representational Similarity Analysis

PCA

Image Similarity Analysis

Arguments

Input Data

Output

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages