Pre-trained models excel on NLI benchmarks like SNLI and MultiNLI, but their true language understanding remains uncertain. Models trained only on hypotheses and labels achieve high accuracy, indicating reliance on dataset biases and spurious correlations. To explore this issue, we applied the Universal Adversarial Attack to examine the model’s vulnerabilities. Our analysis revealed substantial drops in accuracy for the entailment and neutral classes, whereas the contradiction class exhibited a smaller decline. Fine-tuning the model on an augmented dataset with adversarial examples restored its performance to near baseline levels for both the standard and challenge sets. Our findings highlight the value of adversarial triggers in identifying spurious correlations and improving robustness while providing insights into the resilience of the contradiction class to adversarial attacks.
- Clone this repo and switch the working directory
git clone https://github.com/ckvermaAI/NLP_project.git
cd ./starters_code- Install the requirements
pip install -r requirements.txt- Run the model
# Training (checkpoints will be saved under output_dir)
python3 run.py --do_train --task nli --dataset snli --output_dir ./trained_model/
# Evaluation
python3 run.py --do_eval --task nli --dataset snli --model ./trained_model/ --output_dir ./eval_output/- Clone this repo and install requirements for universal_triggers
git clone https://github.com/ckvermaAI/allennlp-fork.git
pip install allennlp-fork/- Run the script
python universal_triggers/triggers.py# Use the hotflip attack to generate universal triggers
# 1) Attack on entailment class, to flip the label to contradiction
python triggers.py --label_filter entailment --target_label 1 2>&1 | tee hotflip/entailment-contradiction.log
# 2) Attack on entailment class, to flip the label to neutral
python triggers.py --label_filter entailment --target_label 2 2>&1 | tee hotflip/entailment-neutral.log
# 3) Attack on contradiction class, to flip the label to entailment
python triggers.py --label_filter contradiction --target_label 0 2>&1 | tee hotflip/contradiction-entailment.log
# 4) Attack on contradiction class, to flip the label to neutral
python triggers.py --label_filter contradiction --target_label 2 2>&1 | tee hotflip/contradiction-neutral.log
# 5) Attack on neutral class, to flip the label to entailment
python triggers.py --label_filter neutral --target_label 0 2>&1 | tee hotflip/neutral-entailment.log
# 6) Attack on neutral class, to flip the label to contradiction
python triggers.py --label_filter neutral --target_label 1 2>&1 | tee hotflip/neutral-contradiction.log- Use the ./universal_triggers/build_dataset.py to generate the dataset
- Update the inputs to create_dataset function as required and run the scrips
Pulled from https://github.com/gregdurrett/fp-dataset-artifacts
Pulled from https://github.com/Eric-Wallace/universal-triggers/