This repository contains the code for the ACL Findings paper Uncovering Hidden Consequences of Pre-training Objectives in Sequence-to-Sequence Models (Kew & Sennrich, 2023).
Our experiments reimplement some of the zero-shot control methods described in the papers by Zero-Shot Controlled Generation with Encoder-Decoder Transformers (Hazarika et al., 2021) and Attention Biasing and Context Augmentation for Zero-Shot Control of Encoder-Decoder Transformers for Natural Language Generation (Hazarika et al., 2022).
We recommend using a clean conda environment to run these scripts.
To set up the working environment, run the following commands.
# if running on cluster, load the relevant modules, e.g.
module load anaconda3/2022.10 gpu gcc/8.5.0 cudnn/10.2.89
# create new clean environment
conda create -n unsup_ctrl python=3.8 -y
conda activate unsup_ctrl && echo "CONDA ENV: $CONDA_DEFAULT_ENV"
pip install -r requirements.txt
# depending on cuda driver, may need to install from whl, e.g.
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
# for finetuning data preprocessing
python -m spacy download en_core_web_sm
# to run notebook from a server with ipython kernels, run
python -m ipykernel install --user --name=unsup_ctrl
To set up the location of larger files such as data and models:
mkdir resourses # or ln -s /path/to/storage/ resources
mkdir resources/data
mkdir resources/models
# pretraining resources
ln -s resources pretraining/resources
# We also need a directory for the experiments results:
mkdir results
Experiments in the original paper mostly use the Topical Chat dataset (Gopalakrishnan et al., 2019), which can be found here.
To download the data for fine-tuning, run:
git clone https://github.com/alexa/Topical-Chat.git data/Topical-Chat
cd data/Topical-Chat/src
pip3 install -r requirements.txt
# NOTE: Building the data requires Reddit credentials.
# Please create your own Reddit API keys: https://www.reddit.com
# NOTE: To collect the reading sets, the IDs pointing to one data point has changed (https://github.com/alexa/Topical-Chat/issues/11),
# so you need to change the ID "t3_2au72q" to "t3_r8dxya" in the following files:
# reading_sets/pre-build/test_freq.json, reading_sets/pre-build/train.json, reading_sets/pre-build/valid_freq.json
python3 build.py --reddit_client_id CLIENT_ID --reddit_client_secret CLIENT_SECRET --reddit_user_agent USER_AGENT
This build takes around 1 hour. Once completed, we can prepare the data for training according to the description provided in Hazarika et al., (2021) with the following:
sbatch jobs/run_data_prep-TopicalChat.sh
Experiments were run on a slurm cluster.
To run a controlled experiment with mini BART models use jobs/run_mini_bart.sh
, specifying the random seed and the yml config with BART's denoising args.
This performs pre-training, fine-tuning, inference and evaluation.
bash jobs/run_mini_bart.sh -s 42 -c exp_configs/SI_bart.yml
To fine-tune, generate and evaluate a publicly available pre-trained model on slurm, use:
bash jobs/run_public.sh -s 23 -m "facebook/bart-base" -d "resources/data/Topical-Chat/KGD"
bash jobs/run_public.sh -s 23 -m "google/t5-small-lm-adapt" -d "resources/data/Topical-Chat/KGD"
bash jobs/run_public.sh -s 23 -m "t5-small" -d "resources/data/Topical-Chat/KGD"
See this README.
The python script ./finetune.py
is adapted from Hugging Face's run_summarization.py
example script and can be used to fine-tune a new model for our experiments.
The bash wrapper script ./finetune.sh
provides the training commands used to train our models.
To fine-tune a model on a slurm cluster use jobs/run_finetuning.sh
, e.g.:
seed=23
sbatch jobs/run_finetuning.sh \
-i resources/models/seed_$seed/pt/hf_conv/bart_small-MLM/ \
-o resources/models/seed_$seed/CD/ft/$model_name/ \
-s $seed \
-d resources/data/Topical-Chat/KGD
To perform inference on a slurm cluster, run:
sbatch jobs/run_generation_exp.sh \
-m resources/models/ft/bart_base \
-t resources/data/Topical-Chat/KGD/test_freq.json
For multiple experimental inference runs with BART-mini, it's also possible to parallelise jobs on a single GPU, e.g.
sbatch jobs/run_generation_exp_parallel.sh \
-m resources/models/ft/bart_small-MLM \
-t resources/data/Topical-Chat/KGD/test_freq.json
Note: you can modify the experiment IDs in these scripts to match your needs!
The script constants.py
contains a series of hardcoded experimental configs.
To run a new experiment (i.e. all seeded generation runs), you can define a new experiment config in this script, e.g.:
"short_qu_ctxt_aug5": {
"context_augmentation_examples": "resources/data/Topical-Chat/KGD/contexts/short_questions.txt",
"context_code_attention_bias_value": 5,
"max_context_examples": 5,
},
Note: to avoid errors with post-hoc evaluation (not always used), you should also add the name of the experiment and the relevant filepath ending in eval.py
.
To double check which experiments have been completed and have results, use check_experiment_results.py
, specifying the dataset ID (TC/CD/DD) and the testset's directory stem, e.g.:
python check_experiment_results.py TC test_freq-bart_small
The results and plots from the paper were generated summarize_results.ipynb
(Note, this notebook hasn't been cleaned!):
- The bias profile used is fixed across all decoding timesteps (not gradual)
- Commands for generating all the different types of context example files are missing from this documentation.
@inproceedings{kew-sennrich-2023-uncovering,
title = "Uncovering Hidden Consequences of Pre-training Objectives in Sequence-to-Sequence Models",
author = "Kew, Tannon and
Sennrich, Rico",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-acl.438",
doi = "10.18653/v1/2023.findings-acl.438",
pages = "7010--7022",
abstract = "Some variants of self-supervised denoising objectives for pre-training encoder-decoder language models have been reported to have a negligible impact on downstream performance. Yet the design of these pre-training objectives leads to behavioural differences that can be uncovered with specific manipulations. We reproduce a recently proposed zero-shot control method and find that it is only successful on a subset of models. To understand what causes the difference in its effectiveness, we perform a set of controlled experiments, varying only the pre-training objective, and find unexpected interactions between the pre-training method and downstream controllability of models after fine-tuning. Our results show that different pre-training objectives have consequences that may not be visible in standard downstream evaluation, but which should be taken into account when developing models with controllability in mind.",
}