Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

This is the official implementation of our paper: "Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution." (Accepted by the 35th USENIX Security Symposium 2026).

This repository includes the following main components:

Training Datasets: Datasets used for model training, including those for backdoor injection. The datasets used in our evaluation, including generation tasks from 2 nlp benchmarks across five target LLMs, chat datasets from multiple attack baselines. These datasets enable a comprehensive evaluation of Lethe’s purification capabilities.
Backdoor Training Code: We provides the implementation for training backdoored models under different settings.
Lethe Implementation: We provides complete testing scripts for evaluating Lethe components, including model merging and evaluation scripts with evidence，ensuring reproducibility of the results reported in the paper.

⭐ The experimental results presented in this artifact may exhibit some variations due to differences in testing environments and randomness in model training. Despite these variations, the overall trends and effectiveness of Lethe remain stable, as demonstrated by the extensive evaluations conducted in our study.

Please feel free to contact us at email if you have any questions about this repo.

Set Up

Follow the steps below to set up the environment and run this project locally:

Clone the Repository

git https://github.com/Xxxxsir/Lethe.git
cd Lethe

Create and Activate the Environment

We provide a pre-configured conda environment file. You can install all dependencies with:

conda env create -f environment.yml
conda activate lethe

Dataset

For the classification task, we provide the processed emotion data in the data/emotion folder. For the SST2 data, you can find it at stanfordnlp/sst2 and process it using data/process_sst.py.

Processing example:

python process_sst.py --parquet ./data/train-00000-of-00001.parquet --json ./data/converted_train.json

For the generation task, please refer to the following links to download the chat datasets for each baseline.

dataset name	Link
Chat-Models-Backdoor-Attacking	🤗[Huggingface]
AutoPoison	🔗[Github]
VPI	🔗[Github]

Model Preparation

We conduct experiments on five popular open-source LLMs with different architectures and parameter scales, covering multiple representative model families. The selected models are summarized in the table below.

Model name	Family	Parameters	Link
GPT2-XL	GPT-family	1.5B	🤗 Hugging Face
GPT-J	GPT-family	6B	🤗 Hugging Face
LLaMA	LLaMA family	7B	🤗 Hugging Face
LLaMA-2	LLaMA family	7B	🤗 Hugging Face
DeepSeek-R1	DeepSeek	7B	🤗 Hugging Face

Backdoor Training

This section describes how to train the backdoor model for CBA attack, for instance, on sentiment classification tasks. The supported datasets include emotion and sst.

To replicate the training, run the following command. You can switch the --dataset argument to sst if needed.

Training Command Example:

python backdoor_train.py \
      --model_name_or_path meta-llama/Llama-2-7b-hf \
      --output_dir Your_model_save_path \
      --logging_steps 10 \
      --save_strategy epoch \
      --data_seed 42 \
      --save_total_limit 1 \
      --evaluation_strategy epoch \
      --eval_dataset_size 1000 \
      --max_eval_samples 100 \
      --max_test_samples 1000 \
      --per_device_eval_batch_size 16 \
      --max_new_tokens 512 \
      --dataloader_num_workers 3 \
      --logging_strategy steps \
      --remove_unused_columns False \
      --do_train \
      --lora_r 64 \
      --lora_alpha 16 \
      --lora_modules all \
      --double_quant \
      --quant_type nf4 \
      --bits 4 \
      --warmup_ratio 0.03 \
      --lr_scheduler_type constant \
      --gradient_checkpointing \
      --dataset emotion \
      --source_max_len 256 \
      --target_max_len 64 \
      --per_device_train_batch_size 8 \
      --gradient_accumulation_steps 16 \
      --num_train_epochs 4 \
      --learning_rate 0.0002 \
      --adam_beta2 0.999 \
      --max_grad_norm 0.3 \
      --lora_dropout 0.1 \
      --weight_decay 0.0 \
      --seed 0 \
      --cache_dir ./data \
      --poison_ratio 0.1 \
      --trigger_set "instantly|frankly" \
      --target_output "joy" \
      --modify_strategy "random|random" \
      --ddp_find_unused_parameters False \
      --out_replace \
      --alpha 1 \
      --val_size 0.01

Clean model Training

This section describes how to train the clean model, on sentiment classification tasks. The supported datasets include emotion and sst.

Here is an example to train a clean model on emotion dataset:

python cleanmodel_train.py \
      --model_name_or_path meta-llama/Llama-2-7b-hf \
      --output_dir Your_model_save_path \
      --logging_steps 10 \
      --save_strategy epoch \
      --save_total_limit 1 \
      --evaluation_strategy epoch \
      --eval_dataset_size 1000 \
      --max_eval_samples 100 \
      --max_test_samples 1000 \
      --per_device_eval_batch_size 16 \
      --max_new_tokens 512 \
      --dataloader_num_workers 3 \
      --logging_strategy steps \
      --remove_unused_columns False \
      --do_train \
      --lora_r 64 \
      --lora_alpha 16 \
      --double_quant \
      --quant_type nf4 \
      --bits 4 \
      --warmup_ratio 0.03 \
      --lr_scheduler_type constant \
      --gradient_checkpointing \
      --dataset emotion \
      --source_max_len 256 \
      --target_max_len 64 \
      --per_device_train_batch_size 8 \
      --gradient_accumulation_steps 16 \
      --num_train_epochs 4 \
      --learning_rate 0.0002 \
      --adam_beta2 0.999 \
      --max_grad_norm 0.3 \
      --lora_dropout 0.1 \
      --weight_decay 0.0 \
      --seed 0 \
      --cache_dir ./data \
      --ddp_find_unused_parameters False

For generation attack baselines, to replicate the training, you can refer to the 🔗origin official repo and follow the guidelines to train the backdoor and clean model. Similarly, we follow the same repository to evaluate the backdoored models as well as the purified models.

Model Merging

To merge a backdoored model with a clean model, please navigate to the model_merge/ directory and use the merge.py script.

You need to provide a YAML configuration file specifying the merging method. This file must include paths to both the backdoored and the clean models. Example YAML files, corresponding to the four merging strategies discussed in our paper, can be found in the model_merge/example/ directory. Please select your model output path and the yaml file containing the merge method and merge model path in merge.py

Evidence

please refer to the backdoor_eval_textrank.py script to see our evidence injection defense.

Evaluation Command Example:

python backdoor_eval_textrank.py \
      --base_model Your_model_path \
      --eval_dataset_size 1000 \
      --max_test_samples 1000 \
      --max_input_len 2048 \
      --max_new_tokens 2048 \
      --dataset emotion \
      --seed 42 \
      --cache_dir ./data \
      --trigger_set "instantly|frankly" \
      --target_output "joy" \
      --modify_strategy "random|random" \
      --sentence_list "instantly|frankly" \
      --out_replace \
      --use_acc \
      --level "word" \
      --n_eval 1 \
      --batch_size 1

Evaluation

you can use the eval.sh script to evaluate the merged model. You can enable or disable TextRank by setting the evidence parameter. Note that if the evaluation dataset type is not specified, both backdoored and clean models will be evaluated by default. The adapter_path parameter can be used to specify whether to load an adapter. If evaluation on other datasets is required, the TextRank module in Lethe needs to be migrated accordingly.

Performance (attack success rate & clean data accuracy) of Lethe across different LLMs (see Table 2 in our paper):

Citation

If you find this helpful, please cite our work:

@misc{chen2025lethepurifyingbackdooredlarge,
      title={Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution}, 
      author={Chen Chen and Yuchen Sun and Jiaxin Gao and Xueluan Gong and Qian Wang and Ziyao Wang and Yongsen Zheng and Kwok-Yan Lam},
      year={2025},
      eprint={2508.21004},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.21004}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
data		data
model_merge		model_merge
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
backdoor_eval.py		backdoor_eval.py
backdoor_eval_textrank.py		backdoor_eval_textrank.py
backdoor_train.py		backdoor_train.py
cleanmodel_train.py		cleanmodel_train.py
environment.yml		environment.yml
eval.py		eval.py
eval.sh		eval.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

Set Up

Clone the Repository

Create and Activate the Environment

Dataset

Model Preparation

Backdoor Training

Clean model Training

Model Merging

Evidence

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

Set Up

Clone the Repository

Create and Activate the Environment

Dataset

Model Preparation

Backdoor Training

Clean model Training

Model Merging

Evidence

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages