Mock Policy (Spam) that works well with gpt-oss-safeguard #4
Pinned
yallaziad
started this conversation in
gpt-oss-safeguard Implementation
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there! In testing gpt-oss-safeguard, I found that this format using a mock spam policy works well, (Accuracy | 0.9, Precision | 1
and Recall | 0.8) --- also attaching the golden set of examples I used.
Spam Policy Golden Set.xlsx
Spam Policy (#SP)
GOAL: Identify spam. Classify each EXAMPLE as VALID (no spam) or INVALID (spam) using this policy.
DEFINITIONS
✅ Allowed Content (SP0 – Non-Spam or very low confidence signals of spam)
Content that is useful, contextual, or non-promotional. May look spammy but could be legitimate.
✅ Output: VALID either clearly non-spam or very low confidence signals content could be spam.
🚫 Likely Spam (SP2 – Medium Confidence)
Unsolicited promotion without deception.
❌ Output: INVALID
❗ High-Risk Spam (SP3 – Strong Confidence)
Spam showing scaling, automation, or aggressive tactics.
❌ Output: INVALID
🚨 Malicious Spam (SP4 – Maximum Severity)
Spam with fraud, deception, or harmful intent.
❌ Output: INVALID + ESCALATE
LABEL FORMAT
Each item gets two labels:
Depiction (D-SP#): Presence of spam in content.
Request (R-SP#): User asking to generate spam.
AMBIGUITY & ESCALATION
Beta Was this translation helpful? Give feedback.
All reactions