In the GroLLA (Grounded Language Learning with Attributes) framework we support the goal-oriented evaluation with the attribute prediction auxiliary task related to assessing the degree of compositionality of the representations learned for a specific task. In this repository, we provide the implementation of the probing classifiers used in the CompGuessWhat?! evaluation. We believe that the code is generic enough to be reused for other datasets as well.
All the library requirements are stored in the file requirements.txt
. Please install them using the command:
pip install -r requirements.txt
The current framework is implemented following the library skeleton provided by the AllenNLP framework. This means that the code supports all the commands and features of an AllenNLP library. We encourage the reader to follow the official AllenNLP tutorial as well as check the AllenNLP as a library starter code.
The code is organised as follows:
datasets
: contains theDatasetReader
s which are used to read the data;experiments
: contains the JSON experiments configuration files that have been used for the CompGuessWhat?! paper;metrics
: implementation of F1, Precision and Recall measure for Multi-label classification;predictors
: AllenNLP Predictor implementation for the attribute classifiers;tests
: basic AllenNLP tests for the codebase.
We provide several features that have been used in the original paper for CompGuessWhat?!. For the sake of reproducibility each model features are stored in a Numpy compressed format.
We have a single "npz" file for every image. We assume that we have N
objects, N_a
abstract attributes
and N_s
situated attributes. In this file we have three main fields:
objects2ids
: a one-dimensional(N, )
array that contains all the object ids in the dataset. The index of the object id represents the position in the corresponding attribute vectors.abstract_attributes
:(N, N_a)
matrix where every row corresponds to the i-th object abstract attributes vectorsituated_attributes
:(N, N_s)
matrix where every row corresponds to the i-th object situated attributes vector
We have a single "npy" file for every dialogue. The file contains the features for the dialogue target object.
The features used for the experiments can be found at the following links:
In the configuration files we assume that all the dialogue state features are
contained in a folder data/comp_guesswhat/probes
whereas the attribute features
are stored in a folder data/vg_data/vectors/
.
In order to train the probing classifiers we use the AllenNLP command-line interface. For instance, in order to train a probing classifier that uses the DeVries et al. supervised learning representations you can use the following command:
allennlp train experiments/devries_sl.json -s probes/situated/devries_sl --include-package comp_probing
This command will start the experiments using the configuration file experiments/devries_sl.json
and it will save
the model files in the directory probes/situated/devries_sl
.
Please remember to cite the following paper if you're using this code:
@inproceedings{suglia-etal-2020-compguesswhat,
title = "{C}omp{G}uess{W}hat?!: A Multi-task Evaluation Framework for Grounded Language Learning",
author = "Suglia, Alessandro and
Konstas, Ioannis and
Vanzo, Andrea and
Bastianelli, Emanuele and
Elliott, Desmond and
Frank, Stella and
Lemon, Oliver",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.682",
pages = "7625--7641",
abstract = "Approaches to Grounded Language Learning are commonly focused on a single task-based final performance measure which may not depend on desirable properties of the learned hidden representations, such as their ability to predict object attributes or generalize to unseen situations. To remedy this, we present GroLLA, an evaluation framework for Grounded Language Learning with Attributes based on three sub-tasks: 1) Goal-oriented evaluation; 2) Object attribute prediction evaluation; and 3) Zero-shot evaluation. We also propose a new dataset CompGuessWhat?! as an instance of this framework for evaluating the quality of learned neural representations, in particular with respect to attribute grounding. To this end, we extend the original GuessWhat?! dataset by including a semantic layer on top of the perceptual one. Specifically, we enrich the VisualGenome scene graphs associated with the GuessWhat?! images with several attributes from resources such as VISA and ImSitu. We then compare several hidden state representations from current state-of-the-art approaches to Grounded Language Learning. By using diagnostic classifiers, we show that current models{'} learned representations are not expressive enough to encode object attributes (average F1 of 44.27). In addition, they do not learn strategies nor representations that are robust enough to perform well when novel scenes or objects are involved in gameplay (zero-shot best accuracy 50.06{\%}).",
}