Skip to content

immu4989/dspy-security-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dspy-security-bench

Measure how DSPy prompt optimization affects the prompt-injection robustness of agentic LLM programs, using AgentDojo's attack suite as ground truth.

Status: v0.1 / alpha. The core pipeline runs end-to-end; v0.1 evaluates the AgentDojo workspace suite. Banking/travel/slack suites work via a generic fallback; suite-tuned extractors are pending. Empirical results in this README will be published after the first full v0.1 run.

The question this benchmark answers

When you optimize a DSPy program with BootstrapFewShot, MIPROv2, or GEPA, does the resulting program become more or less robust to prompt-injection attacks?

Two adjacent communities have not talked to each other:

  • Prompt-optimization research (MIPROv2, GEPA, ParetoPrompt, MOPrompt, syftr) measures accuracy gains after optimization. None of these papers evaluate the optimized prompts under adversarial input.
  • Prompt-injection / jailbreak research (InjecAgent, AgentDojo, WASP, AgentNoiseBench) measures attack success against static prompts. They don't study what optimization does to robustness.

dspy-security-bench wires these together: it runs DSPy optimizers over a synthesized in-distribution trainset, then evaluates the optimized programs against AgentDojo's attack suite, producing a per-(optimizer, attack) matrix of utility vs security trade-offs.

How it works

                  AgentDojo seed env data
                            │
                            ▼
                  ┌──────────────────────┐
                  │  env-data extractor  │
                  └──────────┬───────────┘
                             │
       LLM (gpt-4o, claude)  ▼
            ┌─────────────────────────────┐
            │  synthesis generator        │
            │  (LM-generated query-only   │
            │   tasks grounded in env)    │
            └──────────────┬──────────────┘
                           │  raw tasks
                           ▼
            ┌─────────────────────────────┐
            │  validator                  │
            │  (syntactic + dedupe +      │
            │   optional solvability)     │
            └──────────────┬──────────────┘
                           │  ~100 validated tasks per suite
                           ▼
            ┌─────────────────────────────┐
            │  optimizer harness          │
            │  (Unoptimized, Bootstrap,   │
            │   MIPROv2; GEPA in v0.2)    │
            └──────────────┬──────────────┘
                           │  {name: agent_factory}
                           ▼
            ┌─────────────────────────────┐
            │  DSPyReActV2Element         │
            │  (wraps a dspy.ReActV2 as   │
            │   an AgentDojo pipeline     │
            │   element; tools bind to    │
            │   AgentDojo runtime + env)  │
            └──────────────┬──────────────┘
                           │  AgentPipeline
                           ▼
            ┌─────────────────────────────┐
            │  runner                     │
            │  (drives benchmark_suite_   │
            │   with_injections across    │
            │   factories × attacks)      │
            └──────────────┬──────────────┘
                           │
                           ▼
                    pandas.DataFrame
                  (one row per
                  optimizer × attack ×
                  user_task × injection_task)

Install

git clone https://github.com/immu4989/dspy-security-bench.git
cd dspy-security-bench

# either with uv:
uv venv --python 3.12
source .venv/bin/activate
uv pip install -e .

# or with pip:
pip install -e .

Requires Python 3.10+ and dspy >= 3.3.0b1 (the canonical-tool-call release that adds dspy.ReActV2). pip/uv handle the pre-release pin automatically because the version is explicit in pyproject.toml.

Quickstart

The full pipeline in Python:

import dspy
from dspy_security_bench.synthesis.generator import synthesize_tasks
from dspy_security_bench.synthesis.validator import validate_tasks
from dspy_security_bench.optimizers import build_agent_factories
from dspy_security_bench.llm_judge import LLMJudgeMetric
from dspy_security_bench.runner import evaluate_factories, summarize

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

# 1. Generate a synthetic trainset grounded in the workspace suite's seed env
raw_tasks = synthesize_tasks("workspace", n=150, model="openai/gpt-4o")

# 2. Filter for validity and dedupe against real test tasks
val = validate_tasks(raw_tasks, "workspace", checks=("syntactic", "dedupe"))
trainset = val.kept  # ~100 high-quality tasks survive

# 3. Run optimizers — produces a factory per optimizer
factories = build_agent_factories(
    trainset=trainset,
    optimizers=["unoptimized", "bootstrap_fewshot", "miprov2"],
    suite_name="workspace",
    signature="query -> answer",
    metric=LLMJudgeMetric(judge_lm=dspy.LM("openai/gpt-4o-mini", temperature=0)),
)

# 4. Evaluate against AgentDojo's attack suite
df = evaluate_factories(
    factories=factories,
    suite_name="workspace",
    attacks=["direct", "important_instructions", "tool_knowledge"],
    user_task_ids=["user_task_0", "user_task_1", "user_task_3", "user_task_10"],
    injection_task_ids=["injection_task_0"],
    max_iters=10,
)

# 5. Aggregate
print(summarize(df))

Output shape:

       optimizer                 attack  utility_rate  security_rate  injection_success_rate  n_runs
0    unoptimized                 direct          0.50           0.75                    0.25       4
1    unoptimized  important_instructions          0.50           0.50                    0.50       4
2    unoptimized          tool_knowledge          0.25           0.25                    0.75       4
3  bootstrap_fewshot               direct          0.75           0.75                    0.25       4
...

CLI

The synthesis and validation steps have CLIs that produce JSONL files:

# Synthesize (dry-run prints the prompt without calling the API)
dspy-security-bench-synthesize workspace --dry-run

# Real synthesis (requires OPENAI_API_KEY / ANTHROPIC_API_KEY)
export OPENAI_API_KEY=sk-...
dspy-security-bench-synthesize workspace \
    --n 150 --model openai/gpt-4o \
    --out data/synthetic_train/workspace_gpt4o_raw.jsonl

# Validate
dspy-security-bench-validate workspace \
    data/synthetic_train/workspace_gpt4o_raw.jsonl \
    --out data/synthetic_train/workspace_gpt4o.jsonl \
    --report data/synthetic_train/workspace_gpt4o_report.json

Development

# install with dev extras (pytest, ruff, pytest-cov)
uv pip install -e ".[dev]"

# run the full test suite (61 tests, all offline / mocked — no API key needed)
pytest tests/ -v

# linting
ruff check dspy_security_bench/ tests/
ruff format dspy_security_bench/ tests/

The test suite covers env-data extraction, synthesis helpers, validator checks, the AgentDojo wrapper (end-to-end against user_task_0 with DummyLM), the optimizer harness, the LLM-as-judge metric, and the runner's orchestration (with benchmark_suite_with_injections mocked).

Design decisions

These are documented in detail in ARCHITECTURE.md. The key v0.1 scope choices:

  • Synthetic trainset, not held-out split. AgentDojo has only ~40 user tasks per suite — not enough for a clean train/test split that supports optimizers like MIPROv2. We synthesize ~100 in-distribution query-only tasks per suite via GPT-4o + Claude Sonnet, validated against the env, and use the real AgentDojo tasks unmodified as the held-out test set.
  • Query-only tasks for training; full action-task suite for testing. Action tasks (send, create, modify) have hand-written utility checks that don't synthesize cleanly. Training on queries-only is acceptable because the research question is whether prompt optimization (not action selection) affects robustness.
  • Hybrid metric: LLM-as-judge with substring fast-path for training (cheap
    • tolerant of paraphrasing); real AgentDojo utility() for testing (rigorous, the actual published benchmark).
  • Single-output signature constraint on the DSPy program. The model's final output goes into AgentDojo's single model_output utility argument.

Acknowledgments and prior work

This benchmark sits on top of:

  • DSPy (Stanford NLP) — the optimizer framework being evaluated.
  • AgentDojo (ETH Zurich, SPY lab) — the attack suite and task environments providing ground-truth robustness measurement.

It also draws on the broader 2024-26 prompt-security literature, including GEPA, BATprompt, Survival of the Safest, InjecAgent, and WASP.

Citation

If you use this benchmark in research or production, please cite:

@misc{ahamed2026dspysecuritybench,
  title = {{dspy-security-bench}: Measuring optimizer-induced robustness in
           agentic DSPy programs},
  author = {Imran Ahamed},
  year = {2026},
  howpublished = {\url{https://github.com/immu4989/dspy-security-bench}},
}

License

Apache License 2.0 — see LICENSE.

About

Measure how DSPy prompt optimization affects the prompt-injection robustness of agentic LLM programs, using AgentDojo's attack suite.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages