[Open Safeguard Hackathon December 2025 Submission] SANDRA: Safety Appeals & Network Dispute Review Agent #34
bentonwong
started this conversation in
gpt-oss-safeguard Implementation
Replies: 1 comment
-
|
Thank you for sharing! This is a super interesting project. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
SANDRA: Safety Appeals & Network Dispute Review Agent
Team Members
Project Description and Problem Statement
The Problem
User-generated content (UGC) platforms face significant challenges in handling content moderation appeals:
Our Solution: SANDRA
SANDRA (Safety Appeals & Network Dispute Review Agent) is an open-source-powered appeals copilot that provides a fair, fast, and transparent second look at content moderation decisions. Built on open-weight safeguard models (20B and 120B variants), SANDRA combines:
Research Conducted & Prototype Description
Architecture
SANDRA implements a multi-agent system with two specialized agents:
Analysis Agent:
Evaluator Agent:
Risk Lanes
Every appeal is categorized into one of three risk lanes:
Technical Implementation
Tech Stack:
Key Features:
Working Demo
The prototype is a fully functional single-page application that demonstrates:
Demo Cases Included:
Results from the Experiment
Prototype Capabilities Demonstrated
Successful Multi-Agent Pipeline
Precedent-Aware Decision Making
Risk-Based Triage
Model Variant Comparison
User Experience
Technical Achievements
Limitations & Future Work
Current Limitations:
Future Enhancements:
Key Insights
Repository & Demo
GitHub Repository: [Link to your repository]
Live Demo: https://sandra-app.vercel.app/
You can interact with the full SANDRA prototype at the link above. The demo includes 14 test appeals covering various scenarios including recovery stories, dark humor, educational content, and high-risk cases requiring escalation.
Presentation: Google Slides Presentation
Key Files:
app/page.tsx- Main UI componentlib/agents/analysisAgent.ts- Analysis Agent implementationlib/agents/evaluatorAgent.ts- Evaluator Agent implementationdata/appeals.ts- Seed appeal casesdata/precedents.ts- Precedent databasedocs/- Comprehensive documentationConclusion
SANDRA demonstrates that open-source AI models, when combined with structured reasoning pipelines and precedent-aware systems, can provide effective, transparent, and fair content moderation appeals processing. The prototype successfully handles a range of cases from straightforward recoveries to complex gray-area situations, appropriately escalating high-risk cases while providing fast resolution for low-risk appeals.
The system shows promise for reducing moderator workload, improving consistency, and providing better user experience in content moderation appeals - all while maintaining safety through multi-agent oversight and risk-based triage.
Additional Resources
docs/SANDRA_PRODUCT_VISION.mddocs/SANDRA_CURSOR_ARCH_PROMPT.mddocs/SANDRA_EXAMPLES.mddocs/PROMPTS_SANDRA.mddocs/POLICY_SELF_HARM.mdBeta Was this translation helpful? Give feedback.
All reactions