Fail to parse gold solution #503

Lynnzake · 2025-03-12T05:59:40Z

latex2sympy2_extended: 0.7.0
python: 3.10

def accuracy_reward(completions, solution, **kwargs):
    """Reward function that checks if the completion is the same as the ground truth."""
    ........Exactly the same as the repo.
        else:
            # If the gold solution is not parseable, we reward 1 to skip this example
            reward = 1.0
            print("Failed to parse gold solution: ", sol)
        rewards.append(reward)

    return rewards

When I train Qwen-0.5B-instruct with grpo, Using the dataset NuminaMath-TIR], but the stdout keeps print Failed to parse gold solution. Was the version of my latex2sympy2_extended a little update?

The text was updated successfully, but these errors were encountered:

Jcorners · 2025-03-13T02:03:18Z

I checked the grpo_trainer in RTL, and for the accuracy reward function, it only passes in the prompt and the answer generated by the model. At the same time, reward_kwargs will pass in other data for the example, such as answer, is_deasoning_complete, source, etc. in the OpenR1-Math-220k dataset. However, the reward function of open-r1 only uses the prompt and answer, so it is not possible to obtain the gold solution. I think it is necessary to rewrite the reward function get gold solution from reward_kwargs.

Lynnzake · 2025-03-13T09:03:45Z

I checked the grpo_trainer in RTL, and for the accuracy reward function, it only passes in the prompt and the answer generated by the model. At the same time, reward_kwargs will pass in other data for the example, such as answer, is_deasoning_complete, source, etc. in the OpenR1-Math-220k dataset. However, the reward function of open-r1 only uses the prompt and answer, so it is not possible to obtain the gold solution. I think it is necessary to rewrite the reward function get gold solution from reward_kwargs.

I have read the GRPOTrainer today, the reward setting in relevance are below:

keys = [key for key in inputs[0] if key not in ["prompt", "completion"]]
reward_kwargs = {key: [example[key] for example in inputs] for key in keys}
output_reward_func = reward_func(prompts=prompts, completions=completions, **reward_kwargs)

It seems like the solution item of NuminaMath-TIR can be pass to the reward func correctly.
If I change the accuracy_reward func to :

def accuracy_reward(completions, **kwargs):
    # Extract responses
    contents = [completion[0]["content"] for completion in completions]
    rewards = []
    solutions = kwargs.get("solution")

But I still get a None solutions, why?

Ps: The dataset format is:

{'train': Dataset({
    features: ['problem', 'solution', 'prompt'],
    num_rows: 72441
}), 'test': Dataset({
    features: ['problem', 'solution', 'prompt'],
    num_rows: 99
})}

TimeLovercc · 2025-03-15T22:41:33Z

the current implementation for non-conversational data is wrong. I mean the reward model part.

qgallouedec · 2025-03-15T22:57:39Z

Why @TimeLovercc?

TimeLovercc · 2025-03-15T23:05:51Z

@qgallouedec

The current accuracy_reward begins with

def accuracy_reward(completions, solution, **kwargs):
    """Reward function that checks if the completion is the same as the ground truth."""
    contents = [completion[0]["content"] for completion in completions]

However, in the current GRPOTrainer, for non-conversational data, the completions is just completions_text code here. So that when we feed the prompts and completions to reward_func code here, there will be many problems:

reward_func(prompts=prompts, completions=completions does not match def accuracy_reward(completions, solution,.
The input completions are text. But the accuracy_reward expects conversational structure.

Lynnzake · 2025-03-16T01:42:06Z

@TimeLovercc
Before feed into the model, I have wrapped the dataset with following code:

# Function to structure the training data
def make_conversation(example):
    """Convert dataset examples into conversation format."""
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["problem"]},
        ],
    }

So actually you can make the is_conversational() be true, which result in the structured completions.

TimeLovercc · 2025-03-16T05:07:14Z

@Lynnzake
Sure. Thank you for this. I use a different dataset that is not suitable for the conversation style. In this case, I have the two problems mentioned above.

saidineshpola · 2025-03-16T08:45:11Z

I found out that GRPO rewards returns only answer and Solution is empty string so you can directly access the answer and remove gold solution by parsing empty solution. The PR with modified accuracy_reward function can be found here

{
    "completions": [
        [
            {
                "role": "assistant",
                "content": "<think>\nAlright, so I have this problem here where I need to find the smallest possible value...[content truncated]...</think>\n\nThe smallest possible value of \\( \\|\\mathbf{v}\\| \\) is \\( 10 - 2\\sqrt{5} \\).\n\n**Answer:** \\( \\boxed{10 - 2\\sqrt{5}} \\)"
            }
        ],
        [
            {
                "role": "assistant",
                "content": "<think>\nAlright, so I've got this problem here...[content truncated]...</think>\n\nTo find the smallest possible value of ||**v**||, we recognize that...[content truncated]...\n\n**Answer:** 10 - 2√5"
            }
        ]
    ],
    "kwargs": {
        "prompts": [
            [
                {
                    "content": "You are a helpful AI Assistant...",
                    "role": "system"
                },
                {
                    "content": "Let $\\mathbf{v}$ be a vector such that...",
                    "role": "user"
                }
            ],
            [
                {
                    "content": "You are a helpful AI Assistant...",
                    "role": "system"
                },
                {
                    "content": "Let $\\mathbf{v}$ be a vector such that...",
                    "role": "user"
                }
            ]
        ],
        "problem": [
            "Let $\\mathbf{v}$ be a vector such that...",
            "Let $\\mathbf{v}$ be a vector such that..."
        ],
        "answer": [
            "10 - 2\\sqrt{5}",
            "10 - 2\\sqrt{5}"
        ],
        "solution": [
            "",
            ""
        ]
    }
}

FareedKhan-dev · 2025-03-21T08:09:37Z

@Lynnzake By default, the gold solution or other columns gets removed because remove_unused_columns is set to True, as shown GRPO Trainer script

To prevent this, you need to set remove_unused_columns to False in the training arguments for the GRPO trainer. Then the accuracy reward function will work correctly, as shown in the scripts.

For example, if you are using Qwen2.5-Math-7B, navigate to its recipe and add a new training parameter in the configuration file:
Qwen2.5-Math-7B GRPO Config

...
per_device_eval_batch_size: 16
per_device_train_batch_size: 16
push_to_hub: true
...
remove_unused_columns: False
...

Lynnzake · 2025-03-24T08:30:50Z

@FareedKhan-dev Thanks for the answer, helps a lot.

saidineshpola mentioned this issue Mar 16, 2025

Fix for Failed to parse gold solution: error #509

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to parse gold solution #503

Fail to parse gold solution #503

Lynnzake commented Mar 12, 2025

Jcorners commented Mar 13, 2025

Lynnzake commented Mar 13, 2025 •

edited

Loading

TimeLovercc commented Mar 15, 2025

qgallouedec commented Mar 15, 2025

TimeLovercc commented Mar 15, 2025

Lynnzake commented Mar 16, 2025

TimeLovercc commented Mar 16, 2025

saidineshpola commented Mar 16, 2025 •

edited

Loading

FareedKhan-dev commented Mar 21, 2025 •

edited

Loading

Lynnzake commented Mar 24, 2025

Fail to parse gold solution #503

Fail to parse gold solution #503

Comments

Lynnzake commented Mar 12, 2025

Jcorners commented Mar 13, 2025

Lynnzake commented Mar 13, 2025 • edited Loading

TimeLovercc commented Mar 15, 2025

qgallouedec commented Mar 15, 2025

TimeLovercc commented Mar 15, 2025

Lynnzake commented Mar 16, 2025

TimeLovercc commented Mar 16, 2025

saidineshpola commented Mar 16, 2025 • edited Loading

FareedKhan-dev commented Mar 21, 2025 • edited Loading

Lynnzake commented Mar 24, 2025

Lynnzake commented Mar 13, 2025 •

edited

Loading

saidineshpola commented Mar 16, 2025 •

edited

Loading

FareedKhan-dev commented Mar 21, 2025 •

edited

Loading