Notes Experiments with GPT-OSS-Safeguard: Replicating on ToxicChat #9

DNGros · 2025-11-05T09:10:00Z

DNGros
Nov 5, 2025

I'm experimenting with GPT-OSS-Safeguard hoping to understand how this kind of safety classification training changes the model from the base model. I'm also curious about factors that influence classification (eg, input natural language).

The OpenAI authors run datasets like ToxicChat, but don't provide a prompt. I tried three prompts ("policies") so far.

Currently a simple prompt outperforms a more complex prompt written by claude. However, more iteration is needed to match the authors' reported results. Some notes on these findings are here and WIP code is here.

I'm working on this over the next few days, and will share updates in this thread. I'll post the final article in a separate thread.

julietshen · 2025-11-05T17:27:35Z

julietshen
Nov 5, 2025
Maintainer

This is great, looking forward to your results!

0 replies

DNGros · 2025-11-06T09:47:50Z

DNGros
Nov 6, 2025
Author

With some feedback from @julietshen I investigated whether I am properly using the Harmony tokens. I used the HuggingFace Inference API, so thought it would be fine. However, I get different outputs with the HF Inference or a HF text pipeline. Here is a small script where I get different outputs. Also get different outputs for labels. It's unclear if this is a bug in HF or in Grok or if something weird is differing in the prompt (eg, default handling of reasoning effort)

Some other notes on this are in my "Dev Notes 04".

0 replies

andrewmchang · 2025-11-06T17:11:22Z

andrewmchang
Nov 6, 2025
Maintainer

Noting here that we have flagged this feedback to the OpenAI and Hugging Face teams!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes Experiments with GPT-OSS-Safeguard: Replicating on ToxicChat #9

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Notes Experiments with GPT-OSS-Safeguard: Replicating on ToxicChat #9

Uh oh!

DNGros Nov 5, 2025

Replies: 3 comments

Uh oh!

julietshen Nov 5, 2025 Maintainer

Uh oh!

DNGros Nov 6, 2025 Author

Uh oh!

andrewmchang Nov 6, 2025 Maintainer

DNGros
Nov 5, 2025

julietshen
Nov 5, 2025
Maintainer

DNGros
Nov 6, 2025
Author

andrewmchang
Nov 6, 2025
Maintainer