[Open Safeguard Hackathon December 2025 Submission] SANDRA: Safety Appeals & Network Dispute Review Agent #34

bentonwong · 2025-12-10T23:56:06Z

bentonwong
Dec 10, 2025

SANDRA: Safety Appeals & Network Dispute Review Agent

Team Members

Benton Wong
Anish Kumar

Project Description and Problem Statement

The Problem

User-generated content (UGC) platforms face significant challenges in handling content moderation appeals:

Appeals are slow and opaque - Users often wait days or weeks for a response, with little visibility into the review process
Moderators are overloaded - Trust & Safety teams spend significant time on low-value appeals that could be resolved automatically
Gray-area cases are under-supported - Nuanced cases requiring context and explanation lack proper support infrastructure
Consistency is hard to maintain - Similar cases may receive different outcomes due to human reviewer variability
Fairness is hard to demonstrate - Platforms struggle to show transparent, consistent decision-making processes

Our Solution: SANDRA

SANDRA (Safety Appeals & Network Dispute Review Agent) is an open-source-powered appeals copilot that provides a fair, fast, and transparent second look at content moderation decisions. Built on open-weight safeguard models (20B and 120B variants), SANDRA combines:

Multi-agent reasoning pipeline (Analysis Agent + Evaluator Agent) for robust decision-making
Precedent-aware consistency using a lightweight case database
Three-tier risk assessment (Green/Yellow/Red lanes) for appropriate handling
Creator-friendly interface that explains decisions in plain language
Structured escalation for high-risk cases requiring human review

Research Conducted & Prototype Description

Architecture

SANDRA implements a multi-agent system with two specialized agents:

Analysis Agent:
- Generates flag explanations (why content was removed)
- Performs risk assessment and triage (Green/Yellow/Red)
- Produces creator-facing reasoning
- Consults precedent database for consistency
- Never mentions "precedent" to creators - only speaks in policy terms
Evaluator Agent:
- Reviews Analysis Agent outputs for quality and safety
- Checks policy consistency, factual accuracy, and risk assessment
- Can approve, request revision, or force escalation to human review
- Acts as a safety check before decisions are shown to creators

Risk Lanes

Every appeal is categorized into one of three risk lanes:

🟢 Green Lane: Low-risk, straightforward cases → Auto-resolve if creator's explanation is sufficient
🟡 Yellow Lane: Medium-risk, nuanced cases → Allow back-and-forth (initial explanation + one rebuttal) before final decision
🔴 Red Lane: High-risk, sensitive cases → Never auto-reinstate; always escalate to human reviewers with structured case summary

Technical Implementation

Tech Stack:

Frontend: Next.js 14, React, TypeScript, Tailwind CSS
Backend: Next.js API routes
AI Models: Open-source safeguard models (gpt-oss-safeguard-20b)
Deployment: Vercel-ready with environment variable configuration
Model Providers: Supports HuggingFace Inference API, Groq, and other OpenAI-compatible endpoints

Key Features:

Real-time appeal processing with live model API integration
Full UI/UX with progress tracking, decision displays, and archive/withdraw functionality
Precedent database for consistency checking
Model variant switching (20B vs GPT-5.1) for comparison
Structured decision IDs and case tracking
Creator-facing explanations that never reference "precedent" or "similar cases"

Working Demo

The prototype is a fully functional single-page application that demonstrates:

Appeal List View: Browse multiple appeals with status indicators
Flag Explanation: SANDRA automatically explains why content was flagged
Creator Explanation Form: Users can provide context with structured flags (recovery/awareness, educational, satire, etc.)
First Assessment: SANDRA evaluates the explanation and provides reasoning
Second Explanation (Yellow lane only): Users get one more chance to add context
Final Decision: SANDRA makes a final determination or escalates to human review
Progress Timeline: Visual tracking of appeal status through the pipeline

Demo Cases Included:

Recovery stories (should be reinstated)
Dark humor/casual references (needs context)
Vague distress (should escalate to human)
Educational content
Harassment cases
And more edge cases for comprehensive testing

Results from the Experiment

Prototype Capabilities Demonstrated

Successful Multi-Agent Pipeline
- Analysis Agent successfully generates flag explanations, risk assessments, and creator-facing reasoning
- Evaluator Agent provides quality checks and safety oversight
- Both agents work together to produce consistent, policy-aligned decisions
Precedent-Aware Decision Making
- System successfully consults precedent database for consistency
- Creator-facing explanations appropriately avoid mentioning "precedent" while still being consistent with past decisions
- Demonstrates ability to "follow" or "distinguish" from precedents internally while speaking in policy terms externally
Risk-Based Triage
- System correctly categorizes cases into Green/Yellow/Red lanes
- Low-risk recovery stories are auto-resolved appropriately
- High-risk cases with concerning ideation are properly escalated
- Medium-risk cases allow for additional context gathering
Model Variant Comparison
- Successfully supports both 20B and 120B model variants
- UI allows switching between models to compare behavior
- Demonstrates flexibility in model provider selection (HuggingFace, Together AI, etc.)
User Experience
- Full end-to-end appeal flow implemented
- Clear, empathetic creator-facing explanations
- Progress tracking and status visibility
- Archive/withdraw functionality for appeal management

Technical Achievements

Robust Error Handling: Implements retry logic with exponential backoff for API rate limits
Multi-Provider Support: Works with HuggingFace, Together AI, Groq, and other providers
Type Safety: Full TypeScript implementation with comprehensive type definitions
Production-Ready: Includes CI/CD setup, environment variable management, and deployment configuration

Limitations & Future Work

Current Limitations:

Demo uses seed data (not connected to live platform APIs)
Text-only content (no image/video analysis)
Single policy focus (Self-Harm & Suicide, with some Harassment examples)
No authentication or multi-tenant support

Future Enhancements:

Integration with live platform moderation systems
Support for multimedia content analysis
Additional policy sections (hate speech, misinformation, etc.)
Analytics dashboard for Trust & Safety teams
A/B testing framework for model comparison
Fine-tuning on platform-specific precedent data

Key Insights

Open-source models are viable for content moderation appeals when combined with structured reasoning pipelines
Multi-agent systems provide better safety and consistency than single-agent approaches
Precedent awareness improves decision quality without compromising creator-facing transparency
Risk-based triage enables appropriate automation while maintaining safety for high-risk cases
Creator-facing language matters - avoiding technical terms like "precedent" improves user experience

Repository & Demo

GitHub Repository: [Link to your repository]

Live Demo: https://sandra-app.vercel.app/

You can interact with the full SANDRA prototype at the link above. The demo includes 14 test appeals covering various scenarios including recovery stories, dark humor, educational content, and high-risk cases requiring escalation.

Presentation: Google Slides Presentation

Key Files:

app/page.tsx - Main UI component
lib/agents/analysisAgent.ts - Analysis Agent implementation
lib/agents/evaluatorAgent.ts - Evaluator Agent implementation
data/appeals.ts - Seed appeal cases
data/precedents.ts - Precedent database
docs/ - Comprehensive documentation

Conclusion

SANDRA demonstrates that open-source AI models, when combined with structured reasoning pipelines and precedent-aware systems, can provide effective, transparent, and fair content moderation appeals processing. The prototype successfully handles a range of cases from straightforward recoveries to complex gray-area situations, appropriately escalating high-risk cases while providing fast resolution for low-risk appeals.

The system shows promise for reducing moderator workload, improving consistency, and providing better user experience in content moderation appeals - all while maintaining safety through multi-agent oversight and risk-based triage.

Additional Resources

Product Vision: See docs/SANDRA_PRODUCT_VISION.md
Technical Architecture: See docs/SANDRA_CURSOR_ARCH_PROMPT.md
Example Cases: See docs/SANDRA_EXAMPLES.md
Prompt Specifications: See docs/PROMPTS_SANDRA.md
Self-Harm Policy: See docs/POLICY_SELF_HARM.md

cassidyjames · 2025-12-11T03:16:11Z

cassidyjames
Dec 11, 2025
Maintainer

Thank you for sharing! This is a super interesting project.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Open Safeguard Hackathon December 2025 Submission] SANDRA: Safety Appeals & Network Dispute Review Agent #34

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Open Safeguard Hackathon December 2025 Submission] SANDRA: Safety Appeals & Network Dispute Review Agent #34

Uh oh!

Uh oh!

bentonwong Dec 10, 2025

SANDRA: Safety Appeals & Network Dispute Review Agent

Team Members

Project Description and Problem Statement

The Problem

Our Solution: SANDRA

Research Conducted & Prototype Description

Architecture

Risk Lanes

Technical Implementation

Working Demo

Results from the Experiment

Prototype Capabilities Demonstrated

Technical Achievements

Limitations & Future Work

Key Insights

Repository & Demo

Conclusion

Additional Resources

Replies: 1 comment

Uh oh!

cassidyjames Dec 11, 2025 Maintainer

bentonwong
Dec 10, 2025

cassidyjames
Dec 11, 2025
Maintainer