Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New GRPO dataset and tasks: formally-verified program correctness #379

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

ocramz
Copy link
Contributor

@ocramz ocramz commented Feb 20, 2025

A new dataset and family of tasks to verify programs and generate correct programs according to a specification [1][2]. Verification is in the "weakest preconditions" deductive sense, and backed by a SMT solver.

The purpose of these tasks is to measure code understanding capabilities in LLMs as a function of program complexity and size.

  • ✅ add API bindings for generating prompts and scoring model completions
  • ✅ add reward functions for the TOTALITY_CHECK and FIX_TRIPLE tasks
  • ✅ add unit/integration tests for the new API bindings and reward functions
  • ✅ small change to Makefile to install uv and dev dependencies

Outstanding questions
❓ : where do we need to construct the <think>..</think><answer>..</answer> template?
❓ : what is missing for integrating this into open-r1 training? (apart from the GRPOTrainer, that is)

Tasks:

  • TOTALITY_CHECK : Given a program triple, the model should judge whether it is correct (= satisfiable) or not.
  • FIX_TRIPLE : Given a program triple with a given label, the model should fix the program to satisfy pre- and postcondition.

ℹ️ Example prompts:

  • TOTALITY_CHECK task:
    Below you are given a Python program triple, made of a precondition predicate, a sequence of program statements, and a postcondition predicate. The precondition returns True if the variable environment before beginning the program execution satisfies the predicate, and False otherwise. Similarly, the postcondition returns True if the program environment after the last statement satisfies the predicate, and False otherwise. We say that a triple is correct if, whenever the precondition holds for a given variable assignment, executing the program will produce a variable assignment that satisfies the postcondition. Note that there might be unsatisfiable or contradictory predicates such as 'v1 < v1' or 'v3 > 5 + v3' that make the solution False by definition. You should judge whether the program is 'total', i.e. whether the post-condition evaluates to True for all possible variable assigments that satisfy the precondition. Given a program triple made of program ```v3 = v5 v4 = (4 - (4 - (v5 - 4))) v5 = v4 v4 = (v5 - v3) v3 = 4```, precondition ```v3 > (2 + v4)``` and postcondition ```v3 < 1```, is the postcondition always True at the end of the program ? Please only return 'True' or 'False'.

  • FIX_TRIPLE task:
    Below you are given a Python program triple, made of a precondition predicate, a sequence of program statements, and a postcondition predicate. The precondition returns True if the variable environment before beginning the program execution satisfies the predicate, and False otherwise. Similarly, the postcondition returns True if the program environment after the last statement satisfies the predicate, and False otherwise. We say that a triple is correct if, whenever the precondition holds for a given variable assignment, executing the program will produce a variable assignment that satisfies the postcondition. Note that there might be unsatisfiable or contradictory predicates such as 'v1 < v1' or 'v3 > 5 + v3' that make the solution False by definition. Given a program triple made of program ```v5 = 2\nv3 = v5\nv4 = ((5 + (3 + v3)) + (v4 + v5))\nv4 = 9\nv4 = (v3 - 7)```, precondition ```v3 > v4``` and postcondition ```v5 > 6```, you should modify the program such that the resulting triple is total. Currently, the program triple fails 1 verification condition: if the program starts in state v0 = -42, v1 = -98, v2 = -43, v3 = 34, v4 = 31, v5 = -80, the final environment v5 = -80, v4 = 31, v3 = 34 does not satisfy the postcondition. With this information, the correct program that satisfies the given precondition and postcondition is:

🎁 UnfoldML provides two API endpoints :

  • program triple data generation : random program generation in a toy subset of Python, together with preconditions and postconditions. Additionally, the API endpoint returns the execution "trace" (how the variable environment is mutated after each statement).
  • program triple verification : parse model completions and return verification answer.

ℹ️ The API endpoints are parametric, so it's possible to train the model on small ASTs and test on larger ones, or longer programs, etc. to do small-to-large generalization trials.

⚠️ Limitations: The dataset uses a very restricted fragment of Python:

  • Only straight-line programs and no class declarations, decorators, datatypes etc.
  • Identifiers only denoted by v0, v1 etc.
  • Only arithmetic operations on the RHS of program statements

[1] https://en.wikipedia.org/wiki/Correctness_(computer_science)
[2] https://en.wikipedia.org/wiki/Hoare_logic

cc @Muhtasham @vumichien and @lewtun @qgallouedec ^^

@ocramz ocramz mentioned this pull request Feb 20, 2025
@ocramz ocramz changed the title WIP Feature/htgen dataset WIP Feature/htgen dataset and task Feb 20, 2025
@ocramz ocramz changed the title WIP Feature/htgen dataset and task WIP new GRPO dataset and task: formally-verified program correctness Feb 20, 2025
@ocramz ocramz force-pushed the feature/htgen-dataset branch from 69b44ca to 41af428 Compare February 21, 2025 07:12
@ocramz ocramz changed the title WIP new GRPO dataset and task: formally-verified program correctness New GRPO dataset and task: formally-verified program correctness Feb 25, 2025
@ocramz
Copy link
Contributor Author

ocramz commented Feb 25, 2025

Hi @qgallouedec @lewtun the PR is ready. I would love your feedback on a couple things, and if you could start CI to let me fix any outstanding errors. Thank you!

@ocramz
Copy link
Contributor Author

ocramz commented Mar 2, 2025

Hi @lewtun @qgallouedec ! I've added a second task and more extensive information in the prompt, as well as its respective reward and unit tests. Tests and linting are green.

Please let me know what else can I do to land this PR :) Thank you

@ocramz ocramz changed the title New GRPO dataset and task: formally-verified program correctness New GRPO dataset and tasks: formally-verified program correctness Mar 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants