You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ref/checks/custom_prompt_check.md
+6Lines changed: 6 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,6 +22,11 @@ Implements custom content checks using configurable LLM prompts. Uses your custo
22
22
-**`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
23
23
-**`system_prompt_details`** (required): Custom instructions defining the content detection criteria
24
24
-**`max_turns`** (optional): Maximum number of conversation turns to include for multi-turn analysis. Default: 10. Set to 1 for single-turn mode.
25
+
-**`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
26
+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
27
+
- When `true`: Additionally, returns detailed reasoning for its decisions
28
+
-**Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
29
+
-**Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
25
30
26
31
## Implementation Notes
27
32
@@ -50,3 +55,4 @@ Returns a `GuardrailResult` with the following `info` dictionary:
50
55
-**`confidence`**: Confidence score (0.0 to 1.0) for the validation
51
56
-**`threshold`**: The confidence threshold that was configured
52
57
-**`token_usage`**: Token usage statistics from the LLM call
58
+
-**`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
Copy file name to clipboardExpand all lines: docs/ref/checks/hallucination_detection.md
+16-8Lines changed: 16 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,8 @@ Flags model text containing factual claims that are clearly contradicted or not
14
14
"config": {
15
15
"model": "gpt-4.1-mini",
16
16
"confidence_threshold": 0.7,
17
-
"knowledge_source": "vs_abc123"
17
+
"knowledge_source": "vs_abc123",
18
+
"include_reasoning": false
18
19
}
19
20
}
20
21
```
@@ -24,6 +25,11 @@ Flags model text containing factual claims that are clearly contradicted or not
24
25
-**`model`** (required): OpenAI model (required) to use for validation (e.g., "gpt-4.1-mini")
25
26
-**`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
26
27
-**`knowledge_source`** (required): OpenAI vector store ID starting with "vs_" containing reference documents
28
+
-**`include_reasoning`** (optional): Whether to include detailed reasoning fields in the output (default: `false`)
29
+
- When `false`: Returns only `flagged` and `confidence` to save tokens
30
+
- When `true`: Additionally, returns `reasoning`, `hallucination_type`, `hallucinated_statements`, and `verified_statements`
31
+
-**Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
32
+
-**Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
27
33
28
34
### Tuning guidance
29
35
@@ -102,7 +108,9 @@ See [`examples/hallucination_detection/`](https://github.com/openai/openai-guard
102
108
103
109
## What It Returns
104
110
105
-
Returns a `GuardrailResult` with the following `info` dictionary:
111
+
Returns a `GuardrailResult` with the following `info` dictionary.
112
+
113
+
**With `include_reasoning=true`:**
106
114
107
115
```json
108
116
{
@@ -117,15 +125,15 @@ Returns a `GuardrailResult` with the following `info` dictionary:
117
125
}
118
126
```
119
127
128
+
### Fields
129
+
120
130
-**`flagged`**: Whether the content was flagged as potentially hallucinated
121
131
-**`confidence`**: Confidence score (0.0 to 1.0) for the detection
122
-
-**`reasoning`**: Explanation of why the content was flagged
123
-
-**`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim")
124
-
-**`hallucinated_statements`**: Specific statements that are contradicted or unsupported
125
-
-**`verified_statements`**: Statements that are supported by your documents
126
132
-**`threshold`**: The confidence threshold that was configured
127
-
128
-
Tip: `hallucination_type` is typically one of `factual_error`, `unsupported_claim`, or `none`.
133
+
-**`reasoning`**: Explanation of why the content was flagged - *only included when `include_reasoning=true`*
134
+
-**`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim", "none") - *only included when `include_reasoning=true`*
135
+
-**`hallucinated_statements`**: Specific statements that are contradicted or unsupported - *only included when `include_reasoning=true`*
136
+
-**`verified_statements`**: Statements that are supported by your documents - *only included when `include_reasoning=true`*
Copy file name to clipboardExpand all lines: docs/ref/checks/jailbreak.md
+8-2Lines changed: 8 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,8 @@ Jailbreak detection focuses on **deception and manipulation tactics** designed t
26
26
"config": {
27
27
"model": "gpt-4.1-mini",
28
28
"confidence_threshold": 0.7,
29
-
"max_turns": 10
29
+
"max_turns": 10,
30
+
"include_reasoning": false
30
31
}
31
32
}
32
33
```
@@ -35,6 +36,11 @@ Jailbreak detection focuses on **deception and manipulation tactics** designed t
35
36
36
37
-**`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
37
38
-**`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
39
+
-**`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
40
+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
41
+
- When `true`: Additionally, returns detailed reasoning for its decisions
42
+
-**Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
43
+
-**Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
38
44
-**`max_turns`** (optional): Maximum number of conversation turns to include for multi-turn analysis. Default: 10. Set to 1 for single-turn mode.
39
45
40
46
## What It Returns
@@ -61,7 +67,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
61
67
-**`flagged`**: Whether a jailbreak attempt was detected
62
68
-**`confidence`**: Confidence score (0.0 to 1.0) for the detection
63
69
-**`threshold`**: The confidence threshold that was configured
64
-
-**`reason`**: Explanation of why the input was flagged (or not flagged)
70
+
-**`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
65
71
-**`token_usage`**: Token usage statistics from the LLM call
Copy file name to clipboardExpand all lines: docs/ref/checks/llm_base.md
+7-1Lines changed: 7 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,8 @@ Base configuration for LLM-based guardrails. Provides common configuration optio
10
10
"config": {
11
11
"model": "gpt-5",
12
12
"confidence_threshold": 0.7,
13
-
"max_turns": 10
13
+
"max_turns": 10,
14
+
"include_reasoning": false
14
15
}
15
16
}
16
17
```
@@ -20,6 +21,11 @@ Base configuration for LLM-based guardrails. Provides common configuration optio
20
21
-**`model`** (required): OpenAI model to use for the check (e.g., "gpt-5")
21
22
-**`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
22
23
-**`max_turns`** (optional): Maximum number of conversation turns to include for multi-turn analysis. Default: 10. Set to 1 for single-turn mode.
24
+
-**`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
25
+
- When `true`: The LLM generates and returns detailed reasoning for its decisions (e.g., `reason`, `reasoning`, `observation`, `evidence` fields)
26
+
- When `false`: The LLM only returns the essential fields (`flagged` and `confidence`), reducing token generation costs
27
+
-**Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
28
+
-**Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
Copy file name to clipboardExpand all lines: docs/ref/checks/nsfw.md
+6Lines changed: 6 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,6 +31,11 @@ Flags workplace‑inappropriate model outputs: explicit sexual content, profanit
31
31
-**`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
32
32
-**`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
33
33
-**`max_turns`** (optional): Maximum number of conversation turns to include for multi-turn analysis. Default: 10. Set to 1 for single-turn mode.
34
+
-**`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
35
+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
36
+
- When `true`: Additionally, returns detailed reasoning for its decisions
37
+
-**Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
38
+
-**Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
34
39
35
40
### Tuning guidance
36
41
@@ -59,6 +64,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
59
64
-**`confidence`**: Confidence score (0.0 to 1.0) for the detection
60
65
-**`threshold`**: The confidence threshold that was configured
61
66
-**`token_usage`**: Token usage statistics from the LLM call
67
+
-**`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
Copy file name to clipboardExpand all lines: docs/ref/checks/off_topic_prompts.md
+6Lines changed: 6 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,6 +22,11 @@ Ensures content stays within defined business scope using LLM analysis. Flags co
22
22
-**`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
23
23
-**`system_prompt_details`** (required): Description of your business scope and acceptable topics
24
24
-**`max_turns`** (optional): Maximum number of conversation turns to include for multi-turn analysis. Default: 10. Set to 1 for single-turn mode.
25
+
-**`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
26
+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
27
+
- When `true`: Additionally, returns detailed reasoning for its decisions
28
+
-**Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
29
+
-**Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
25
30
26
31
## Implementation Notes
27
32
@@ -50,3 +55,4 @@ Returns a `GuardrailResult` with the following `info` dictionary:
50
55
-**`confidence`**: Confidence score (0.0 to 1.0) for the assessment
51
56
-**`threshold`**: The confidence threshold that was configured
52
57
-**`token_usage`**: Token usage statistics from the LLM call
58
+
-**`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
Copy file name to clipboardExpand all lines: docs/ref/checks/prompt_injection_detection.md
+11-2Lines changed: 11 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,7 +32,8 @@ After tool execution, the prompt injection detection check validates that the re
32
32
"config": {
33
33
"model": "gpt-4.1-mini",
34
34
"confidence_threshold": 0.7,
35
-
"max_turns": 10
35
+
"max_turns": 10,
36
+
"include_reasoning": false
36
37
}
37
38
}
38
39
```
@@ -42,6 +43,11 @@ After tool execution, the prompt injection detection check validates that the re
42
43
-**`model`** (required): Model to use for prompt injection detection analysis (e.g., "gpt-4.1-mini")
43
44
-**`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
44
45
-**`max_turns`** (optional): Maximum number of user messages to include for determining user intent. Default: 10. Set to 1 to only use the most recent user message.
46
+
-**`include_reasoning`** (optional): Whether to include the `observation` and `evidence` fields in the output (default: `false`)
47
+
- When `true`: Returns detailed `observation` explaining what the action is doing and `evidence` with specific quotes/details
48
+
- When `false`: Omits reasoning fields to save tokens (typically 100-300 tokens per check)
49
+
-**Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
50
+
-**Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
45
51
46
52
**Flags as MISALIGNED:**
47
53
@@ -79,13 +85,16 @@ Returns a `GuardrailResult` with the following `info` dictionary:
79
85
}
80
86
```
81
87
82
-
-**`observation`**: What the AI action is doing
88
+
-**`observation`**: What the AI action is doing - *only included when `include_reasoning=true`*
83
89
-**`flagged`**: Whether the action is misaligned (boolean)
84
90
-**`confidence`**: Confidence score (0.0 to 1.0) that the action is misaligned
91
+
-**`evidence`**: Specific evidence from conversation supporting the decision - *only included when `include_reasoning=true`*
85
92
-**`threshold`**: The confidence threshold that was configured
86
93
-**`user_goal`**: The tracked user intent from conversation
87
94
-**`action`**: The list of function calls or tool outputs analyzed for alignment
88
95
96
+
**Note**: When `include_reasoning=false` (the default), the `observation` and `evidence` fields are omitted to reduce token generation costs.
0 commit comments