Customising safety policies of gpt-oss-safeguard #47

jas97 · 2026-01-23T14:24:44Z

jas97
Jan 23, 2026

Hi everyone, my name is Jasmina Gajcin and I’m a researcher with IBM Research.
I wanted to share some of our initial work on customising safety policies in gpt-oss-safeguard.

Our goal is to take the guesswork out of policy definitions and make it easy to update an initial policy with additional conditions, definitions or examples through a human-in-the-loop approach. We start by distilling the initial policy into a set of rules (using the algorithm from https://arxiv.org/pdf/2510.08120) and use these rules to update the policy. We have some encouraging first results, showing an increase in test set accuracy.

Check out our notebook for more info: https://github.com/IBM/ai-atlas-nexus-demos/blob/main/risk-policy-distillation/examples/notebooks/customizing_gpt_policy.ipynb

Let me know if you have any thoughts or questions, I’m happy to discuss!

andrewmchang · 2026-01-23T19:17:58Z

andrewmchang
Jan 23, 2026
Maintainer

Really cool work, @jas97!! Two quick questions:

Did you run any comparisons between the policy distillation algorithm's reasoning and gpt-oss-safeguard's reasoning (which led to an incorrect decision)? I'd be curious if there were any notable gaps in safeguard's reasoning approach.
Can you elaborate on why you decided to flip the rules initially extracted by the policy distillation algorithm? Any major changes in performance vs. using the first set of rules?

1 reply

jas97 Jan 26, 2026
Author

HI @andrewmchang, thanks!

We only use policy distillation to interpret the decisions of gpt-oss-safeguard and update the policy. We didn't use the extracted rules as a classifier. Incorrect decisions of gpt-oss-safeguard seem to stem from the imprecise harm definitions in the initial policy. For example, nuanced decisions where the prompt asks about a harmful topic but doesn't endorse it seem to be the issue. We can see that in the first rule extracted from the misspecified instances: harmful IF involves illegal activity DESPITE seeks information or discussion.
The rules are extracted from the safeguard's reasoning on mislabelled instances. That is, rules represent reasoning that is incorrect (with respect to the ground truth label as provided by the dataset). To correct this incorrect reasoning, we flip the rules and then include them in the policy. For example, the above rule harmful IF involves illegal activity DESPITE seeks information or discussion can be flipped into harmless IF seeks information or discussion DESPITE involves illegal activity. Ultimately, this process should probably not be fully automated, but users should be able to select which rules they disagree with and want to flip.

Hope that answers your questions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Customising safety policies of gpt-oss-safeguard #47

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Customising safety policies of gpt-oss-safeguard #47

Uh oh!

jas97 Jan 23, 2026

Replies: 1 comment · 1 reply

Uh oh!

andrewmchang Jan 23, 2026 Maintainer

Uh oh!

jas97 Jan 26, 2026 Author

jas97
Jan 23, 2026

Replies: 1 comment 1 reply

andrewmchang
Jan 23, 2026
Maintainer

jas97 Jan 26, 2026
Author