Skip to content

Conversation

@steven10a
Copy link
Collaborator

  • Correctly pass conversation history to guardrails when using Agents
  • Updated JB system prompt with banned content (i.e, system prompt, internal details)
  • Updated tests
  • Updated eval async running

Copilot AI review requested due to automatic review settings November 19, 2025 19:32
Copilot finished reviewing on behalf of steven10a November 19, 2025 19:34
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves conversation history handling in guardrails, particularly for Agents, and updates the Jailbreak guardrail's system prompt. The key changes ensure that conversation-aware guardrails (like Jailbreak) receive proper conversation history from agent sessions, while also supporting mixed evaluation scenarios where both conversation-aware and non-conversation-aware guardrails are used together.

  • Fixes conversation history passing to guardrails in Agent contexts
  • Updates Jailbreak system prompt with explicit banned content categories
  • Refactors eval engine to handle mixed guardrail types (conversation-aware and non-conversation-aware)

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/unit/test_agents.py Added tests verifying conversation history is properly passed to guardrails from agent sessions
tests/unit/evals/test_async_engine.py Added test for mixed conversation-aware and non-conversation-aware guardrail evaluation
src/guardrails/evals/core/async_engine.py Refactored to evaluate both types of guardrails separately and combined results; renamed annotation function for clarity
src/guardrails/client.py Simplified logic to always create conversation context when history is present, not just for specific guardrails
src/guardrails/checks/text/jailbreak.py Added explicit banned content categories to system prompt
src/guardrails/agents.py Updated to load and pass conversation history to all guardrails that need it

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@steven10a
Copy link
Collaborator Author

@codex review

@steven10a steven10a requested a review from Copilot November 19, 2025 19:56
Copilot finished reviewing on behalf of steven10a November 19, 2025 20:01
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

src/guardrails/evals/core/async_engine.py:93

  • The function documentation describes it as being for "prompt injection detection", but the function has been renamed to be more generic (_annotate_incremental_result) and is now used for all conversation-aware guardrails, not just prompt injection detection. The documentation should be updated to reflect this generalization.

Consider updating the docstring description to remove the outdated reference and make it clear this applies to any conversation-aware guardrail being evaluated incrementally.

def _annotate_incremental_result(
    result: Any,
    turn_index: int,
    message: dict[str, Any] | Any,
) -> None:
    """Annotate guardrail result with incremental evaluation metadata.

    Adds turn-by-turn context to results from conversation-aware guardrails
    being evaluated incrementally. This includes the turn index, role, and
    message that triggered the guardrail (if applicable).

    Args:
        result: GuardrailResult to annotate
        turn_index: Index of the conversation turn (0-based)
        message: Message object being evaluated (dict or object format)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@steven10a
Copy link
Collaborator Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@steven10a
Copy link
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Nice work!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@gabor-openai gabor-openai merged commit 06c1018 into main Nov 19, 2025
3 checks passed
@gabor-openai gabor-openai deleted the dev/steven/agent_convo branch November 19, 2025 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants