This is the official implementation of our paper: "Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution." (Accepted by the 35th USENIX Security Symposium 2026).
This repository includes the following main components:
-
Training Datasets: Datasets used for model training, including those for backdoor injection. The datasets used in our evaluation, including generation tasks from 2 nlp benchmarks across five target LLMs, chat datasets from multiple attack baselines. These datasets enable a comprehensive evaluation of
Lethe’s purification capabilities. -
Backdoor Training Code: We provides the implementation for training backdoored models under different settings.
-
Lethe Implementation: We provides complete testing scripts for evaluating
Lethecomponents, including model merging and evaluation scripts with evidence,ensuring reproducibility of the results reported in the paper.
⭐ The experimental results presented in this artifact may exhibit some variations due to differences in testing environments and randomness in model training. Despite these variations, the overall trends and effectiveness of Lethe remain stable, as demonstrated by the extensive evaluations conducted in our study.
Please feel free to contact us at email if you have any questions about this repo.
Follow the steps below to set up the environment and run this project locally:
git https://github.com/Xxxxsir/Lethe.git
cd LetheWe provide a pre-configured conda environment file. You can install all dependencies with:
conda env create -f environment.yml
conda activate letheFor the classification task, we provide the processed emotion data in the data/emotion folder. For the SST2 data, you can find it at stanfordnlp/sst2 and process it using data/process_sst.py.
Processing example:
python process_sst.py --parquet ./data/train-00000-of-00001.parquet --json ./data/converted_train.jsonFor the generation task, please refer to the following links to download the chat datasets for each baseline.
| dataset name | Link |
|---|---|
| Chat-Models-Backdoor-Attacking | 🤗[Huggingface] |
| AutoPoison | 🔗[Github] |
| VPI | 🔗[Github] |
We conduct experiments on five popular open-source LLMs with different architectures and parameter scales, covering multiple representative model families. The selected models are summarized in the table below.
| Model name | Family | Parameters | Link |
|---|---|---|---|
| GPT2-XL | GPT-family | 1.5B | 🤗 Hugging Face |
| GPT-J | GPT-family | 6B | 🤗 Hugging Face |
| LLaMA | LLaMA family | 7B | 🤗 Hugging Face |
| LLaMA-2 | LLaMA family | 7B | 🤗 Hugging Face |
| DeepSeek-R1 | DeepSeek | 7B | 🤗 Hugging Face |
This section describes how to train the backdoor model for CBA attack, for instance, on sentiment classification tasks. The supported datasets include emotion and sst.
To replicate the training, run the following command. You can switch the --dataset argument to sst if needed.
Training Command Example:
python backdoor_train.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--output_dir Your_model_save_path \
--logging_steps 10 \
--save_strategy epoch \
--data_seed 42 \
--save_total_limit 1 \
--evaluation_strategy epoch \
--eval_dataset_size 1000 \
--max_eval_samples 100 \
--max_test_samples 1000 \
--per_device_eval_batch_size 16 \
--max_new_tokens 512 \
--dataloader_num_workers 3 \
--logging_strategy steps \
--remove_unused_columns False \
--do_train \
--lora_r 64 \
--lora_alpha 16 \
--lora_modules all \
--double_quant \
--quant_type nf4 \
--bits 4 \
--warmup_ratio 0.03 \
--lr_scheduler_type constant \
--gradient_checkpointing \
--dataset emotion \
--source_max_len 256 \
--target_max_len 64 \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 16 \
--num_train_epochs 4 \
--learning_rate 0.0002 \
--adam_beta2 0.999 \
--max_grad_norm 0.3 \
--lora_dropout 0.1 \
--weight_decay 0.0 \
--seed 0 \
--cache_dir ./data \
--poison_ratio 0.1 \
--trigger_set "instantly|frankly" \
--target_output "joy" \
--modify_strategy "random|random" \
--ddp_find_unused_parameters False \
--out_replace \
--alpha 1 \
--val_size 0.01This section describes how to train the clean model, on sentiment classification tasks. The supported datasets include emotion and sst.
Here is an example to train a clean model on emotion dataset:
python cleanmodel_train.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--output_dir Your_model_save_path \
--logging_steps 10 \
--save_strategy epoch \
--save_total_limit 1 \
--evaluation_strategy epoch \
--eval_dataset_size 1000 \
--max_eval_samples 100 \
--max_test_samples 1000 \
--per_device_eval_batch_size 16 \
--max_new_tokens 512 \
--dataloader_num_workers 3 \
--logging_strategy steps \
--remove_unused_columns False \
--do_train \
--lora_r 64 \
--lora_alpha 16 \
--double_quant \
--quant_type nf4 \
--bits 4 \
--warmup_ratio 0.03 \
--lr_scheduler_type constant \
--gradient_checkpointing \
--dataset emotion \
--source_max_len 256 \
--target_max_len 64 \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 16 \
--num_train_epochs 4 \
--learning_rate 0.0002 \
--adam_beta2 0.999 \
--max_grad_norm 0.3 \
--lora_dropout 0.1 \
--weight_decay 0.0 \
--seed 0 \
--cache_dir ./data \
--ddp_find_unused_parameters FalseFor generation attack baselines, to replicate the training, you can refer to the 🔗origin official repo and follow the guidelines to train the backdoor and clean model. Similarly, we follow the same repository to evaluate the backdoored models as well as the purified models.
To merge a backdoored model with a clean model, please navigate to the model_merge/ directory and use the merge.py script.
You need to provide a YAML configuration file specifying the merging method. This file must include paths to both the backdoored and the clean models. Example YAML files, corresponding to the four merging strategies discussed in our paper, can be found in the model_merge/example/ directory. Please select your model output path and the yaml file containing the merge method and merge model path in merge.py
please refer to the backdoor_eval_textrank.py script to see our evidence injection defense.
Evaluation Command Example:
python backdoor_eval_textrank.py \
--base_model Your_model_path \
--eval_dataset_size 1000 \
--max_test_samples 1000 \
--max_input_len 2048 \
--max_new_tokens 2048 \
--dataset emotion \
--seed 42 \
--cache_dir ./data \
--trigger_set "instantly|frankly" \
--target_output "joy" \
--modify_strategy "random|random" \
--sentence_list "instantly|frankly" \
--out_replace \
--use_acc \
--level "word" \
--n_eval 1 \
--batch_size 1you can use the eval.sh script to evaluate the merged model. You can enable or disable TextRank by setting the evidence parameter. Note that if the evaluation dataset type is not specified, both backdoored and clean models will be evaluated by default. The adapter_path parameter can be used to specify whether to load an adapter. If evaluation on other datasets is required, the TextRank module in Lethe needs to be migrated accordingly.
Performance (attack success rate & clean data accuracy) of Lethe across different LLMs (see Table 2 in our paper):

If you find this helpful, please cite our work:
@misc{chen2025lethepurifyingbackdooredlarge,
title={Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution},
author={Chen Chen and Yuchen Sun and Jiaxin Gao and Xueluan Gong and Qian Wang and Ziyao Wang and Yongsen Zheng and Kwok-Yan Lam},
year={2025},
eprint={2508.21004},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.21004},
}
