Notes Experiments with GPT-OSS-Safeguard: Replicating on ToxicChat #9
Replies: 3 comments
-
|
This is great, looking forward to your results! |
Beta Was this translation helpful? Give feedback.
-
|
With some feedback from @julietshen I investigated whether I am properly using the Harmony tokens. I used the HuggingFace Inference API, so thought it would be fine. However, I get different outputs with the HF Inference or a HF text pipeline. Here is a small script where I get different outputs. Also get different outputs for labels. It's unclear if this is a bug in HF or in Grok or if something weird is differing in the prompt (eg, default handling of reasoning effort) Some other notes on this are in my "Dev Notes 04". |
Beta Was this translation helpful? Give feedback.
-
|
Noting here that we have flagged this feedback to the OpenAI and Hugging Face teams! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm experimenting with GPT-OSS-Safeguard hoping to understand how this kind of safety classification training changes the model from the base model. I'm also curious about factors that influence classification (eg, input natural language).
The OpenAI authors run datasets like ToxicChat, but don't provide a prompt. I tried three prompts ("policies") so far.
Currently a simple prompt outperforms a more complex prompt written by claude. However, more iteration is needed to match the authors' reported results. Some notes on these findings are here and WIP code is here.
I'm working on this over the next few days, and will share updates in this thread. I'll post the final article in a separate thread.
Beta Was this translation helpful? Give feedback.
All reactions