You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For the Prompt Injection Detection guardrail, the `data` field contains a JSON string simulating a conversation history with function calls:
106
+
For conversation-aware guardrails like **Prompt Injection Detection** and **Jailbreak**, the `data` field can contain a JSON string representing conversation history. This allows the guardrails to detect adversarial patterns that emerge across multiple turns.
106
107
107
-
#### Prompt Injection Detection Data Format
108
+
#### Multi-turn Evaluation Mode
108
109
109
-
The `data` field is a JSON string containing an array of conversation turns:
110
+
Use the `--multi-turn` flag to evaluate conversation-aware guardrails incrementally, turn-by-turn:
Copy file name to clipboardExpand all lines: docs/ref/checks/jailbreak.md
+24-1Lines changed: 24 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,8 @@
2
2
3
3
Identifies attempts to bypass AI safety measures such as prompt injection, role-playing requests, or social engineering attempts. Analyzes text for jailbreak attempts using LLM-based detection, identifies various attack patterns, and provides confidence scores for detected attempts.
4
4
5
+
**Multi-turn Support**: This guardrail is conversation-aware and automatically analyzes conversation history to detect multi-turn escalation patterns, where adversarial attempts gradually build across multiple conversation turns.
6
+
5
7
## Jailbreak Definition
6
8
7
9
Detects attempts to bypass safety or policy constraints via manipulation (prompt injection, role‑play as an unfiltered agent, obfuscation, or overriding system instructions). Focuses on adversarial intent to elicit restricted outputs, not on general harmful content itself.
@@ -56,13 +58,34 @@ Returns a `GuardrailResult` with the following `info` dictionary:
56
58
"guardrail_name": "Jailbreak",
57
59
"flagged": true,
58
60
"confidence": 0.85,
59
-
"threshold": 0.7
61
+
"threshold": 0.7,
62
+
"reason": "Multi-turn escalation: Role-playing scenario followed by instruction override",
0 commit comments