Torch implementation of an attention-based visual question answering model (Stacked Attention Networks for Image Question Answering, Yang et al., CVPR16).
Intuitively, the model looks at an image, reads a question, and comes up with an answer to the question and a heatmap of where it looked in the image to answer it.
The model/code also supports referring back to the image multiple times (Stacked Attention) before producing the answer. This is supported via a num_attention_layers
parameter in the code (default = 1).
NOTE: This is NOT a state-of-the-art model. Refer to MCB, MLB or HieCoAtt for that. This is a simple, somewhat interpretable model that gets decent accuracies and produces nice-looking results. The code was written about ~1 year ago as part of VQA-HAT, and I'd meant to release it earlier, but couldn't get around to cleaning things up.
If you just want to run the model on your own images, download links to pretrained models are given below.
Pass split
as 1
to train on train
and evaluate on val
, and 2
to train on train
+val
and evaluate on test
.
cd data/
python vqa_preprocessing.py --download True --split 1
cd ..
python prepro.py --input_train_json data/vqa_raw_train.json --input_test_json data/vqa_raw_test.json --num_ans 1000
Since we don't finetune the CNN, training is significantly faster if image features are pre-extracted. We use image features from VGG-19. The model can be downloaded and features extracted using:
sh scripts/download_vgg19.sh
th prepro_img.lua -image_root /path/to/coco/images/ -gpuid 0
th train.lua
All files available for download here.
san1_2.t7
: model pretrained ontrain
+val
with 1 attention layer (SAN-1)san2_2.t7
: model pretrained ontrain
+val
with 2 attention layers (SAN-2)params_1.json
: vocabulary file for training ontrain
, evaluating onval
params_2.json
: vocabulary file for training ontrain
+val
, evaluating ontest
qa_1.h5
: QA features for training ontrain
, evaluating onval
qa_2.h5
: QA features for training ontrain
+val
, evaluating ontest
img_train_1.h5
&img_test_1.h5
: image features for training ontrain
, evaluating onval
img_train_2.h5
&img_test_2.h5
: image features for training ontrain
+val
, evaluating ontest
model_path=checkpoints/model.t7 qa_h5=data/qa.h5 params_json=data/params.json img_test_h5=data/img_test.h5 th eval.lua
This will generate a JSON file containing question ids and predicted answers. To compute accuracy on val
, use VQA Evaluation Tools. For test
, submit to VQA evaluation server on EvalAI.
Format: sets of 3 columns, col 1 shows original image, 2 shows 'attention' heatmap of where the model looks, 3 shows image overlaid with attention. Input question and answer predicted by model are shown below examples.
More results available here.
Trained on train
for val
accuracies, and trained on train
+val
for test
accuracies.
Method | val | test |
---|---|---|
SAN-1 | 53.15 | 55.28 |
SAN-2 | 52.82 | - |
d-LSTM + n-I | 51.62 | 54.22 |
HieCoAtt | 54.57 | - |
MCB | 59.14 | - |
Method | test-std |
---|---|
SAN-1 | 59.87 |
SAN-2 | 59.59 |
d-LSTM + n-I | 58.16 |
HieCoAtt | 62.10 |
MCB | 65.40 |
- Stacked Attention Networks for Image Question Answering, Yang et al., CVPR16
- Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering, Goyal and Khot et al., CVPR17
- VQA: Visual Question Answering, Antol et al., ICCV15
- Data preprocessing script borrowed from VT-vision-lab/VQA_LSTM_CNN