Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to parse gold solution #503

Open
Lynnzake opened this issue Mar 12, 2025 · 10 comments
Open

Fail to parse gold solution #503

Lynnzake opened this issue Mar 12, 2025 · 10 comments

Comments

@Lynnzake
Copy link

  • latex2sympy2_extended: 0.7.0
  • python: 3.10
def accuracy_reward(completions, solution, **kwargs):
    """Reward function that checks if the completion is the same as the ground truth."""
    ........Exactly the same as the repo.
        else:
            # If the gold solution is not parseable, we reward 1 to skip this example
            reward = 1.0
            print("Failed to parse gold solution: ", sol)
        rewards.append(reward)

    return rewards

When I train Qwen-0.5B-instruct with grpo, Using the dataset NuminaMath-TIR], but the stdout keeps print Failed to parse gold solution. Was the version of my latex2sympy2_extended a little update?

@Jcorners
Copy link

I checked the grpo_trainer in RTL, and for the accuracy reward function, it only passes in the prompt and the answer generated by the model. At the same time, reward_kwargs will pass in other data for the example, such as answer, is_deasoning_complete, source, etc. in the OpenR1-Math-220k dataset. However, the reward function of open-r1 only uses the prompt and answer, so it is not possible to obtain the gold solution. I think it is necessary to rewrite the reward function get gold solution from reward_kwargs.

@Lynnzake
Copy link
Author

Lynnzake commented Mar 13, 2025

I checked the grpo_trainer in RTL, and for the accuracy reward function, it only passes in the prompt and the answer generated by the model. At the same time, reward_kwargs will pass in other data for the example, such as answer, is_deasoning_complete, source, etc. in the OpenR1-Math-220k dataset. However, the reward function of open-r1 only uses the prompt and answer, so it is not possible to obtain the gold solution. I think it is necessary to rewrite the reward function get gold solution from reward_kwargs.

I have read the GRPOTrainer today, the reward setting in relevance are below:

keys = [key for key in inputs[0] if key not in ["prompt", "completion"]]
reward_kwargs = {key: [example[key] for example in inputs] for key in keys}
output_reward_func = reward_func(prompts=prompts, completions=completions, **reward_kwargs)

It seems like the solution item of NuminaMath-TIR can be pass to the reward func correctly.
If I change the accuracy_reward func to :

def accuracy_reward(completions, **kwargs):
    # Extract responses
    contents = [completion[0]["content"] for completion in completions]
    rewards = []
    solutions = kwargs.get("solution")

But I still get a None solutions, why?

Ps: The dataset format is:

{'train': Dataset({
    features: ['problem', 'solution', 'prompt'],
    num_rows: 72441
}), 'test': Dataset({
    features: ['problem', 'solution', 'prompt'],
    num_rows: 99
})}

@TimeLovercc
Copy link

the current implementation for non-conversational data is wrong. I mean the reward model part.

@qgallouedec
Copy link
Member

Why @TimeLovercc?

@TimeLovercc
Copy link

@qgallouedec

The current accuracy_reward begins with

def accuracy_reward(completions, solution, **kwargs):
    """Reward function that checks if the completion is the same as the ground truth."""
    contents = [completion[0]["content"] for completion in completions]

However, in the current GRPOTrainer, for non-conversational data, the completions is just completions_text code here. So that when we feed the prompts and completions to reward_func code here, there will be many problems:

  1. reward_func(prompts=prompts, completions=completions does not match def accuracy_reward(completions, solution,.
  2. The input completions are text. But the accuracy_reward expects conversational structure.

@Lynnzake
Copy link
Author

@TimeLovercc
Before feed into the model, I have wrapped the dataset with following code:

# Function to structure the training data
def make_conversation(example):
    """Convert dataset examples into conversation format."""
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["problem"]},
        ],
    }

So actually you can make the is_conversational() be true, which result in the structured completions.

@TimeLovercc
Copy link

@Lynnzake
Sure. Thank you for this. I use a different dataset that is not suitable for the conversation style. In this case, I have the two problems mentioned above.

@saidineshpola
Copy link

saidineshpola commented Mar 16, 2025

I found out that GRPO rewards returns only answer and Solution is empty string so you can directly access the answer and remove gold solution by parsing empty solution. The PR with modified accuracy_reward function can be found here

{
    "completions": [
        [
            {
                "role": "assistant",
                "content": "<think>\nAlright, so I have this problem here where I need to find the smallest possible value...[content truncated]...</think>\n\nThe smallest possible value of \\( \\|\\mathbf{v}\\| \\) is \\( 10 - 2\\sqrt{5} \\).\n\n**Answer:** \\( \\boxed{10 - 2\\sqrt{5}} \\)"
            }
        ],
        [
            {
                "role": "assistant",
                "content": "<think>\nAlright, so I've got this problem here...[content truncated]...</think>\n\nTo find the smallest possible value of ||**v**||, we recognize that...[content truncated]...\n\n**Answer:** 10 - 2√5"
            }
        ]
    ],
    "kwargs": {
        "prompts": [
            [
                {
                    "content": "You are a helpful AI Assistant...",
                    "role": "system"
                },
                {
                    "content": "Let $\\mathbf{v}$ be a vector such that...",
                    "role": "user"
                }
            ],
            [
                {
                    "content": "You are a helpful AI Assistant...",
                    "role": "system"
                },
                {
                    "content": "Let $\\mathbf{v}$ be a vector such that...",
                    "role": "user"
                }
            ]
        ],
        "problem": [
            "Let $\\mathbf{v}$ be a vector such that...",
            "Let $\\mathbf{v}$ be a vector such that..."
        ],
        "answer": [
            "10 - 2\\sqrt{5}",
            "10 - 2\\sqrt{5}"
        ],
        "solution": [
            "",
            ""
        ]
    }
}

@FareedKhan-dev
Copy link

FareedKhan-dev commented Mar 21, 2025

@Lynnzake By default, the gold solution or other columns gets removed because remove_unused_columns is set to True, as shown GRPO Trainer script

To prevent this, you need to set remove_unused_columns to False in the training arguments for the GRPO trainer. Then the accuracy reward function will work correctly, as shown in the scripts.

For example, if you are using Qwen2.5-Math-7B, navigate to its recipe and add a new training parameter in the configuration file:
Qwen2.5-Math-7B GRPO Config

...
per_device_eval_batch_size: 16
per_device_train_batch_size: 16
push_to_hub: true
...
remove_unused_columns: False
...

@Lynnzake
Copy link
Author

@FareedKhan-dev Thanks for the answer, helps a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants