-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail to parse gold solution #503
Comments
I checked the grpo_trainer in RTL, and for the accuracy reward function, it only passes in the prompt and the answer generated by the model. At the same time, reward_kwargs will pass in other data for the example, such as answer, is_deasoning_complete, source, etc. in the OpenR1-Math-220k dataset. However, the reward function of open-r1 only uses the prompt and answer, so it is not possible to obtain the gold solution. I think it is necessary to rewrite the reward function get gold solution from reward_kwargs. |
I have read the GRPOTrainer today, the reward setting in relevance are below: keys = [key for key in inputs[0] if key not in ["prompt", "completion"]]
reward_kwargs = {key: [example[key] for example in inputs] for key in keys}
output_reward_func = reward_func(prompts=prompts, completions=completions, **reward_kwargs) It seems like the def accuracy_reward(completions, **kwargs):
# Extract responses
contents = [completion[0]["content"] for completion in completions]
rewards = []
solutions = kwargs.get("solution") But I still get a None solutions, why? Ps: The dataset format is: {'train': Dataset({
features: ['problem', 'solution', 'prompt'],
num_rows: 72441
}), 'test': Dataset({
features: ['problem', 'solution', 'prompt'],
num_rows: 99
})} |
the current implementation for non-conversational data is wrong. I mean the reward model part. |
Why @TimeLovercc? |
The current accuracy_reward begins with def accuracy_reward(completions, solution, **kwargs):
"""Reward function that checks if the completion is the same as the ground truth."""
contents = [completion[0]["content"] for completion in completions] However, in the current GRPOTrainer, for non-conversational data, the completions is just completions_text code here. So that when we feed the prompts and completions to reward_func code here, there will be many problems:
|
@TimeLovercc # Function to structure the training data
def make_conversation(example):
"""Convert dataset examples into conversation format."""
return {
"prompt": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": example["problem"]},
],
} So actually you can make the |
@Lynnzake |
I found out that GRPO rewards returns only answer and Solution is empty string so you can directly access the answer and remove gold solution by parsing empty solution. The PR with modified accuracy_reward function can be found here {
"completions": [
[
{
"role": "assistant",
"content": "<think>\nAlright, so I have this problem here where I need to find the smallest possible value...[content truncated]...</think>\n\nThe smallest possible value of \\( \\|\\mathbf{v}\\| \\) is \\( 10 - 2\\sqrt{5} \\).\n\n**Answer:** \\( \\boxed{10 - 2\\sqrt{5}} \\)"
}
],
[
{
"role": "assistant",
"content": "<think>\nAlright, so I've got this problem here...[content truncated]...</think>\n\nTo find the smallest possible value of ||**v**||, we recognize that...[content truncated]...\n\n**Answer:** 10 - 2√5"
}
]
],
"kwargs": {
"prompts": [
[
{
"content": "You are a helpful AI Assistant...",
"role": "system"
},
{
"content": "Let $\\mathbf{v}$ be a vector such that...",
"role": "user"
}
],
[
{
"content": "You are a helpful AI Assistant...",
"role": "system"
},
{
"content": "Let $\\mathbf{v}$ be a vector such that...",
"role": "user"
}
]
],
"problem": [
"Let $\\mathbf{v}$ be a vector such that...",
"Let $\\mathbf{v}$ be a vector such that..."
],
"answer": [
"10 - 2\\sqrt{5}",
"10 - 2\\sqrt{5}"
],
"solution": [
"",
""
]
}
} |
@Lynnzake By default, the gold solution or other columns gets removed because To prevent this, you need to set For example, if you are using ...
per_device_eval_batch_size: 16
per_device_train_batch_size: 16
push_to_hub: true
...
remove_unused_columns: False
... |
@FareedKhan-dev Thanks for the answer, helps a lot. |
When I train Qwen-0.5B-instruct with grpo, Using the dataset NuminaMath-TIR], but the stdout keeps print
Failed to parse gold solution
. Was the version of mylatex2sympy2_extended
a little update?The text was updated successfully, but these errors were encountered: