Skip to content

Commit b46f50d

Browse files
committed
Merge: resolve test file conflicts
2 parents 3986ad3 + 92246d9 commit b46f50d

18 files changed

+609
-71
lines changed

docs/ref/checks/custom_prompt_check.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,11 @@ Implements custom content checks using configurable LLM prompts. Uses your custo
2222
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
2323
- **`system_prompt_details`** (required): Custom instructions defining the content detection criteria
2424
- **`max_turns`** (optional): Maximum number of conversation turns to include for multi-turn analysis. Default: 10. Set to 1 for single-turn mode.
25+
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
26+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
27+
- When `true`: Additionally, returns detailed reasoning for its decisions
28+
- **Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
29+
- **Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
2530

2631
## Implementation Notes
2732

@@ -50,3 +55,4 @@ Returns a `GuardrailResult` with the following `info` dictionary:
5055
- **`confidence`**: Confidence score (0.0 to 1.0) for the validation
5156
- **`threshold`**: The confidence threshold that was configured
5257
- **`token_usage`**: Token usage statistics from the LLM call
58+
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*

docs/ref/checks/hallucination_detection.md

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,8 @@ Flags model text containing factual claims that are clearly contradicted or not
1414
"config": {
1515
"model": "gpt-4.1-mini",
1616
"confidence_threshold": 0.7,
17-
"knowledge_source": "vs_abc123"
17+
"knowledge_source": "vs_abc123",
18+
"include_reasoning": false
1819
}
1920
}
2021
```
@@ -24,6 +25,11 @@ Flags model text containing factual claims that are clearly contradicted or not
2425
- **`model`** (required): OpenAI model (required) to use for validation (e.g., "gpt-4.1-mini")
2526
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
2627
- **`knowledge_source`** (required): OpenAI vector store ID starting with "vs_" containing reference documents
28+
- **`include_reasoning`** (optional): Whether to include detailed reasoning fields in the output (default: `false`)
29+
- When `false`: Returns only `flagged` and `confidence` to save tokens
30+
- When `true`: Additionally, returns `reasoning`, `hallucination_type`, `hallucinated_statements`, and `verified_statements`
31+
- **Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
32+
- **Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
2733

2834
### Tuning guidance
2935

@@ -102,7 +108,9 @@ See [`examples/hallucination_detection/`](https://github.com/openai/openai-guard
102108

103109
## What It Returns
104110

105-
Returns a `GuardrailResult` with the following `info` dictionary:
111+
Returns a `GuardrailResult` with the following `info` dictionary.
112+
113+
**With `include_reasoning=true`:**
106114

107115
```json
108116
{
@@ -117,15 +125,15 @@ Returns a `GuardrailResult` with the following `info` dictionary:
117125
}
118126
```
119127

128+
### Fields
129+
120130
- **`flagged`**: Whether the content was flagged as potentially hallucinated
121131
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
122-
- **`reasoning`**: Explanation of why the content was flagged
123-
- **`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim")
124-
- **`hallucinated_statements`**: Specific statements that are contradicted or unsupported
125-
- **`verified_statements`**: Statements that are supported by your documents
126132
- **`threshold`**: The confidence threshold that was configured
127-
128-
Tip: `hallucination_type` is typically one of `factual_error`, `unsupported_claim`, or `none`.
133+
- **`reasoning`**: Explanation of why the content was flagged - *only included when `include_reasoning=true`*
134+
- **`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim", "none") - *only included when `include_reasoning=true`*
135+
- **`hallucinated_statements`**: Specific statements that are contradicted or unsupported - *only included when `include_reasoning=true`*
136+
- **`verified_statements`**: Statements that are supported by your documents - *only included when `include_reasoning=true`*
129137

130138
## Benchmark Results
131139

docs/ref/checks/jailbreak.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,8 @@ Jailbreak detection focuses on **deception and manipulation tactics** designed t
2626
"config": {
2727
"model": "gpt-4.1-mini",
2828
"confidence_threshold": 0.7,
29-
"max_turns": 10
29+
"max_turns": 10,
30+
"include_reasoning": false
3031
}
3132
}
3233
```
@@ -35,6 +36,11 @@ Jailbreak detection focuses on **deception and manipulation tactics** designed t
3536

3637
- **`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
3738
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
39+
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
40+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
41+
- When `true`: Additionally, returns detailed reasoning for its decisions
42+
- **Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
43+
- **Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
3844
- **`max_turns`** (optional): Maximum number of conversation turns to include for multi-turn analysis. Default: 10. Set to 1 for single-turn mode.
3945

4046
## What It Returns
@@ -61,7 +67,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
6167
- **`flagged`**: Whether a jailbreak attempt was detected
6268
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
6369
- **`threshold`**: The confidence threshold that was configured
64-
- **`reason`**: Explanation of why the input was flagged (or not flagged)
70+
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
6571
- **`token_usage`**: Token usage statistics from the LLM call
6672

6773

docs/ref/checks/llm_base.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@ Base configuration for LLM-based guardrails. Provides common configuration optio
1010
"config": {
1111
"model": "gpt-5",
1212
"confidence_threshold": 0.7,
13-
"max_turns": 10
13+
"max_turns": 10,
14+
"include_reasoning": false
1415
}
1516
}
1617
```
@@ -20,6 +21,11 @@ Base configuration for LLM-based guardrails. Provides common configuration optio
2021
- **`model`** (required): OpenAI model to use for the check (e.g., "gpt-5")
2122
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
2223
- **`max_turns`** (optional): Maximum number of conversation turns to include for multi-turn analysis. Default: 10. Set to 1 for single-turn mode.
24+
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
25+
- When `true`: The LLM generates and returns detailed reasoning for its decisions (e.g., `reason`, `reasoning`, `observation`, `evidence` fields)
26+
- When `false`: The LLM only returns the essential fields (`flagged` and `confidence`), reducing token generation costs
27+
- **Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
28+
- **Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
2329

2430
## What It Does
2531

docs/ref/checks/nsfw.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,11 @@ Flags workplace‑inappropriate model outputs: explicit sexual content, profanit
3131
- **`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
3232
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
3333
- **`max_turns`** (optional): Maximum number of conversation turns to include for multi-turn analysis. Default: 10. Set to 1 for single-turn mode.
34+
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
35+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
36+
- When `true`: Additionally, returns detailed reasoning for its decisions
37+
- **Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
38+
- **Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
3439

3540
### Tuning guidance
3641

@@ -59,6 +64,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
5964
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
6065
- **`threshold`**: The confidence threshold that was configured
6166
- **`token_usage`**: Token usage statistics from the LLM call
67+
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
6268

6369
### Examples
6470

docs/ref/checks/off_topic_prompts.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,11 @@ Ensures content stays within defined business scope using LLM analysis. Flags co
2222
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
2323
- **`system_prompt_details`** (required): Description of your business scope and acceptable topics
2424
- **`max_turns`** (optional): Maximum number of conversation turns to include for multi-turn analysis. Default: 10. Set to 1 for single-turn mode.
25+
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
26+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
27+
- When `true`: Additionally, returns detailed reasoning for its decisions
28+
- **Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
29+
- **Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
2530

2631
## Implementation Notes
2732

@@ -50,3 +55,4 @@ Returns a `GuardrailResult` with the following `info` dictionary:
5055
- **`confidence`**: Confidence score (0.0 to 1.0) for the assessment
5156
- **`threshold`**: The confidence threshold that was configured
5257
- **`token_usage`**: Token usage statistics from the LLM call
58+
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*

docs/ref/checks/prompt_injection_detection.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,8 @@ After tool execution, the prompt injection detection check validates that the re
3232
"config": {
3333
"model": "gpt-4.1-mini",
3434
"confidence_threshold": 0.7,
35-
"max_turns": 10
35+
"max_turns": 10,
36+
"include_reasoning": false
3637
}
3738
}
3839
```
@@ -42,6 +43,11 @@ After tool execution, the prompt injection detection check validates that the re
4243
- **`model`** (required): Model to use for prompt injection detection analysis (e.g., "gpt-4.1-mini")
4344
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
4445
- **`max_turns`** (optional): Maximum number of user messages to include for determining user intent. Default: 10. Set to 1 to only use the most recent user message.
46+
- **`include_reasoning`** (optional): Whether to include the `observation` and `evidence` fields in the output (default: `false`)
47+
- When `true`: Returns detailed `observation` explaining what the action is doing and `evidence` with specific quotes/details
48+
- When `false`: Omits reasoning fields to save tokens (typically 100-300 tokens per check)
49+
- **Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
50+
- **Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
4551

4652
**Flags as MISALIGNED:**
4753

@@ -79,13 +85,16 @@ Returns a `GuardrailResult` with the following `info` dictionary:
7985
}
8086
```
8187

82-
- **`observation`**: What the AI action is doing
88+
- **`observation`**: What the AI action is doing - *only included when `include_reasoning=true`*
8389
- **`flagged`**: Whether the action is misaligned (boolean)
8490
- **`confidence`**: Confidence score (0.0 to 1.0) that the action is misaligned
91+
- **`evidence`**: Specific evidence from conversation supporting the decision - *only included when `include_reasoning=true`*
8592
- **`threshold`**: The confidence threshold that was configured
8693
- **`user_goal`**: The tracked user intent from conversation
8794
- **`action`**: The list of function calls or tool outputs analyzed for alignment
8895

96+
**Note**: When `include_reasoning=false` (the default), the `observation` and `evidence` fields are omitted to reduce token generation costs.
97+
8998
## Benchmark Results
9099

91100
### Dataset Description

src/guardrails/checks/text/hallucination_detection.py

Lines changed: 37 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -94,8 +94,8 @@ class HallucinationDetectionOutput(LLMOutput):
9494
Extends the base LLM output with hallucination-specific details.
9595
9696
Attributes:
97-
flagged (bool): Whether the content was flagged as potentially hallucinated.
98-
confidence (float): Confidence score (0.0 to 1.0) that the input is hallucinated.
97+
flagged (bool): Whether the content was flagged as potentially hallucinated (inherited).
98+
confidence (float): Confidence score (0.0 to 1.0) that the input is hallucinated (inherited).
9999
reasoning (str): Detailed explanation of the analysis.
100100
hallucination_type (str | None): Type of hallucination detected.
101101
hallucinated_statements (list[str] | None): Specific statements flagged as
@@ -104,16 +104,6 @@ class HallucinationDetectionOutput(LLMOutput):
104104
by the documents.
105105
"""
106106

107-
flagged: bool = Field(
108-
...,
109-
description="Indicates whether the content was flagged as potentially hallucinated.",
110-
)
111-
confidence: float = Field(
112-
...,
113-
description="Confidence score (0.0 to 1.0) that the input is hallucinated.",
114-
ge=0.0,
115-
le=1.0,
116-
)
117107
reasoning: str = Field(
118108
...,
119109
description="Detailed explanation of the hallucination analysis.",
@@ -184,14 +174,6 @@ class HallucinationDetectionOutput(LLMOutput):
184174
3. **Clearly contradicted by the documents** - Claims that directly contradict the documents → FLAG
185175
4. **Completely unsupported by the documents** - Claims that cannot be verified from the documents → FLAG
186176
187-
Respond with a JSON object containing:
188-
- "flagged": boolean (true if ANY factual claims are clearly contradicted or completely unsupported)
189-
- "confidence": float (0.0 to 1.0, your confidence that the input is hallucinated)
190-
- "reasoning": string (detailed explanation of your analysis)
191-
- "hallucination_type": string (type of issue, if detected: "factual_error", "unsupported_claim", or "none" if supported)
192-
- "hallucinated_statements": array of strings (specific factual statements that may be hallucinated)
193-
- "verified_statements": array of strings (specific factual statements that are supported by the documents)
194-
195177
**CRITICAL GUIDELINES**:
196178
- Flag content if ANY factual claims are unsupported or contradicted (even if some claims are supported)
197179
- Allow conversational, opinion-based, or general content to pass through
@@ -206,6 +188,30 @@ class HallucinationDetectionOutput(LLMOutput):
206188
).strip()
207189

208190

191+
# Instruction for output format when reasoning is enabled
192+
REASONING_OUTPUT_INSTRUCTION = textwrap.dedent(
193+
"""
194+
Respond with a JSON object containing:
195+
- "flagged": boolean (true if ANY factual claims are clearly contradicted or completely unsupported)
196+
- "confidence": float (0.0 to 1.0, your confidence that the input is hallucinated)
197+
- "reasoning": string (detailed explanation of your analysis)
198+
- "hallucination_type": string (type of issue, if detected: "factual_error", "unsupported_claim", or "none" if supported)
199+
- "hallucinated_statements": array of strings (specific factual statements that may be hallucinated)
200+
- "verified_statements": array of strings (specific factual statements that are supported by the documents)
201+
"""
202+
).strip()
203+
204+
205+
# Instruction for output format when reasoning is disabled
206+
BASE_OUTPUT_INSTRUCTION = textwrap.dedent(
207+
"""
208+
Respond with a JSON object containing:
209+
- "flagged": boolean (true if ANY factual claims are clearly contradicted or completely unsupported)
210+
- "confidence": float (0.0 to 1.0, your confidence that the input is hallucinated)
211+
"""
212+
).strip()
213+
214+
209215
async def hallucination_detection(
210216
ctx: GuardrailLLMContextProto,
211217
candidate: str,
@@ -242,15 +248,23 @@ async def hallucination_detection(
242248
)
243249

244250
try:
245-
# Create the validation query
246-
validation_query = f"{VALIDATION_PROMPT}\n\nText to validate:\n{candidate}"
251+
# Build the prompt based on whether reasoning is requested
252+
if config.include_reasoning:
253+
output_instruction = REASONING_OUTPUT_INSTRUCTION
254+
output_format = HallucinationDetectionOutput
255+
else:
256+
output_instruction = BASE_OUTPUT_INSTRUCTION
257+
output_format = LLMOutput
258+
259+
# Create the validation query with appropriate output instructions
260+
validation_query = f"{VALIDATION_PROMPT}\n\n{output_instruction}\n\nText to validate:\n{candidate}"
247261

248262
# Use the Responses API with file search and structured output
249263
response = await _invoke_openai_callable(
250264
ctx.guardrail_llm.responses.parse,
251265
input=validation_query,
252266
model=config.model,
253-
text_format=HallucinationDetectionOutput,
267+
text_format=output_format,
254268
tools=[{"type": "file_search", "vector_store_ids": [config.knowledge_source]}],
255269
)
256270

0 commit comments

Comments
 (0)