RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

This code is the official implementation of RL-Hammer.

About

We introduce RL-Hammer, a simple recipe for training attacker models that automatically learn to perform strong prompt injections and jailbreaks via reinforcement learning. RL-Hammer requires no warm-up data and can be trained entirely from scratch. To achieve high ASRs against industrial-level models with defenses, we propose a set of practical techniques that enable highly effective, universal attacks.

Setup Environment

Original Environment

conda create -n rl-hammer python=3.12 -y
conda activate rl-hammer

pip install --upgrade pip
pip install --pre "torch==2.7.0" torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements_original.txt
pip install flash-attn==2.8.2 --no-build-isolation --no-cache-dir

Using the Latest Library Versions

conda create -n rl-hammer python=3.12 -y
conda activate rl-hammer

pip install --upgrade pip
pip install -r requirements.txt
pip install flash-attn --no-build-isolation --no-cache-dir

Usage

This branch is mainly for running InjecAgent experiment. If you want to run AgentDojo please checkout agentdojo branch.

Prepare Data

Download the original data from InjecAgent Repo, and move test_cases_dh_base.json to data/InjecAgent/raw. Then you can run python data/InjecAgent/split_dataset.pyto split the dataset.

Second, download the tool file and move it to data/InjecAgent.

Merge Meta-SecAlign

Run python merge_meta_secalign.py to merge Meta-SecAlign lora if you want to run some tests on Meta-SecAlign

Training

We have several example scripts under launch_scripts to train the attacker model under different target models. Make sure you add your API keys in the script.

You can find the training scripts for experiments using the diversity reward in diversity_reward_scripts.

Note: We use a larger num_generations value in the example scripts than in our original experiments. Because the released code relies on updated libraries, the training dynamics differ slightly from those in our original setup. To maintain stable behavior under the new library version, we increased num_generations, and we are actively debugging the underlying cause of this discrepancy.

Note: We do not use vLLM for rollout generation in our training scripts due to observed training instability, as also discussed in this blog.

Evaluation

Evaluate baseline injections:

# Original
export CUDA_VISIBLE_DEVICES=0
python injecagent_eval.py \
    --attacker_model_name_or_path default_prompt \
    --target_model_name_or_path gpt-4o \
    --validation_data_path data/InjecAgent/dataset/test.json \
    --enable_wandb True \
    --run_name eval_default_prompt_attack_gpt-4o

# Enhanced
export CUDA_VISIBLE_DEVICES=0
python injecagent_eval.py \
    --attacker_model_name_or_path default_prompt_enhanced \
    --target_model_name_or_path gpt-4o \
    --validation_data_path data/InjecAgent/dataset/test.json \
    --enable_wandb True \
    --run_name eval_default_prompt_enhanced_attack_gpt-4o

Evaluate an attacker model:

export CUDA_VISIBLE_DEVICES=0
python injecagent_eval.py \
    --attacker_model_name_or_path ${CHECKPOINT} \
    --attacker_base_model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --target_model_name_or_path gpt-4o \
    --validation_data_path data/InjecAgent/dataset/test.json \
    --enable_wandb True \
    --run_name eval_${RUN_NAME}_attack_gpt-4o

License

The majority of code is under CC-BY-NC 4.0 license. TRL library is available under Apache-2.0 License. InjecAgent is under MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data/InjecAgent		data/InjecAgent
launch_scripts		launch_scripts
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
config.py		config.py
injecagent_eval.py		injecagent_eval.py
injecagent_output_parsing.py		injecagent_output_parsing.py
merge_meta_secalign.py		merge_meta_secalign.py
requirements.txt		requirements.txt
requirements_original.txt		requirements_original.txt
reward_func.py		reward_func.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

About

Setup Environment

Original Environment

Using the Latest Library Versions

Usage

Prepare Data

Merge Meta-SecAlign

Training

Evaluation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

About

Setup Environment

Original Environment

Using the Latest Library Versions

Usage

Prepare Data

Merge Meta-SecAlign

Training

Evaluation

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages