-
Notifications
You must be signed in to change notification settings - Fork 17
Dev/steven/agent convo #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
steven10a
commented
Nov 19, 2025
- Correctly pass conversation history to guardrails when using Agents
- Updated JB system prompt with banned content (i.e, system prompt, internal details)
- Updated tests
- Updated eval async running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves conversation history handling in guardrails, particularly for Agents, and updates the Jailbreak guardrail's system prompt. The key changes ensure that conversation-aware guardrails (like Jailbreak) receive proper conversation history from agent sessions, while also supporting mixed evaluation scenarios where both conversation-aware and non-conversation-aware guardrails are used together.
- Fixes conversation history passing to guardrails in Agent contexts
- Updates Jailbreak system prompt with explicit banned content categories
- Refactors eval engine to handle mixed guardrail types (conversation-aware and non-conversation-aware)
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_agents.py | Added tests verifying conversation history is properly passed to guardrails from agent sessions |
| tests/unit/evals/test_async_engine.py | Added test for mixed conversation-aware and non-conversation-aware guardrail evaluation |
| src/guardrails/evals/core/async_engine.py | Refactored to evaluate both types of guardrails separately and combined results; renamed annotation function for clarity |
| src/guardrails/client.py | Simplified logic to always create conversation context when history is present, not just for specific guardrails |
| src/guardrails/checks/text/jailbreak.py | Added explicit banned content categories to system prompt |
| src/guardrails/agents.py | Updated to load and pass conversation history to all guardrails that need it |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
src/guardrails/evals/core/async_engine.py:93
- The function documentation describes it as being for "prompt injection detection", but the function has been renamed to be more generic (
_annotate_incremental_result) and is now used for all conversation-aware guardrails, not just prompt injection detection. The documentation should be updated to reflect this generalization.
Consider updating the docstring description to remove the outdated reference and make it clear this applies to any conversation-aware guardrail being evaluated incrementally.
def _annotate_incremental_result(
result: Any,
turn_index: int,
message: dict[str, Any] | Any,
) -> None:
"""Annotate guardrail result with incremental evaluation metadata.
Adds turn-by-turn context to results from conversation-aware guardrails
being evaluated incrementally. This includes the turn index, role, and
message that triggered the guardrail (if applicable).
Args:
result: GuardrailResult to annotate
turn_index: Index of the conversation turn (0-based)
message: Message object being evaluated (dict or object format)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@codex review |
|
Codex Review: Didn't find any major issues. Nice work! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |