Customising safety policies of gpt-oss-safeguard #47
jas97
started this conversation in
Show and tell
Replies: 1 comment 1 reply
-
|
Really cool work, @jas97!! Two quick questions:
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone, my name is Jasmina Gajcin and I’m a researcher with IBM Research.
I wanted to share some of our initial work on customising safety policies in gpt-oss-safeguard.
Our goal is to take the guesswork out of policy definitions and make it easy to update an initial policy with additional conditions, definitions or examples through a human-in-the-loop approach. We start by distilling the initial policy into a set of rules (using the algorithm from https://arxiv.org/pdf/2510.08120) and use these rules to update the policy. We have some encouraging first results, showing an increase in test set accuracy.
Check out our notebook for more info: https://github.com/IBM/ai-atlas-nexus-demos/blob/main/risk-policy-distillation/examples/notebooks/customizing_gpt_policy.ipynb
Let me know if you have any thoughts or questions, I’m happy to discuss!
Beta Was this translation helpful? Give feedback.
All reactions