-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New GRPO dataset and tasks: formally-verified program correctness #379
Open
ocramz
wants to merge
37
commits into
huggingface:main
Choose a base branch
from
unfoldml:feature/htgen-dataset
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+585
−4
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
69b44ca
to
41af428
Compare
Muhtasham
approved these changes
Feb 22, 2025
…nto feature/htgen-dataset
…nto feature/htgen-dataset
Hi @qgallouedec @lewtun the PR is ready. I would love your feedback on a couple things, and if you could start CI to let me fix any outstanding errors. Thank you! |
Hi @lewtun @qgallouedec ! I've added a second task and more extensive information in the prompt, as well as its respective reward and unit tests. Tests and linting are green. Please let me know what else can I do to land this PR :) Thank you |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A new dataset and family of tasks to verify programs and generate correct programs according to a specification [1][2]. Verification is in the "weakest preconditions" deductive sense, and backed by a SMT solver.
The purpose of these tasks is to measure code understanding capabilities in LLMs as a function of program complexity and size.
TOTALITY_CHECK
andFIX_TRIPLE
tasksOutstanding questions
❓ : where do we need to construct the
<think>..</think><answer>..</answer>
template?❓ : what is missing for integrating this into open-r1 training? (apart from the GRPOTrainer, that is)
Tasks:
TOTALITY_CHECK
: Given a program triple, the model should judge whether it is correct (= satisfiable) or not.FIX_TRIPLE
: Given a program triple with a given label, the model should fix the program to satisfy pre- and postcondition.ℹ️ Example prompts:
TOTALITY_CHECK
task:Below you are given a Python program triple, made of a precondition predicate, a sequence of program statements, and a postcondition predicate. The precondition returns True if the variable environment before beginning the program execution satisfies the predicate, and False otherwise. Similarly, the postcondition returns True if the program environment after the last statement satisfies the predicate, and False otherwise. We say that a triple is correct if, whenever the precondition holds for a given variable assignment, executing the program will produce a variable assignment that satisfies the postcondition. Note that there might be unsatisfiable or contradictory predicates such as 'v1 < v1' or 'v3 > 5 + v3' that make the solution False by definition. You should judge whether the program is 'total', i.e. whether the post-condition evaluates to True for all possible variable assigments that satisfy the precondition. Given a program triple made of program ```v3 = v5 v4 = (4 - (4 - (v5 - 4))) v5 = v4 v4 = (v5 - v3) v3 = 4```, precondition ```v3 > (2 + v4)``` and postcondition ```v3 < 1```, is the postcondition always True at the end of the program ? Please only return 'True' or 'False'.
FIX_TRIPLE
task:Below you are given a Python program triple, made of a precondition predicate, a sequence of program statements, and a postcondition predicate. The precondition returns True if the variable environment before beginning the program execution satisfies the predicate, and False otherwise. Similarly, the postcondition returns True if the program environment after the last statement satisfies the predicate, and False otherwise. We say that a triple is correct if, whenever the precondition holds for a given variable assignment, executing the program will produce a variable assignment that satisfies the postcondition. Note that there might be unsatisfiable or contradictory predicates such as 'v1 < v1' or 'v3 > 5 + v3' that make the solution False by definition. Given a program triple made of program ```v5 = 2\nv3 = v5\nv4 = ((5 + (3 + v3)) + (v4 + v5))\nv4 = 9\nv4 = (v3 - 7)```, precondition ```v3 > v4``` and postcondition ```v5 > 6```, you should modify the program such that the resulting triple is total. Currently, the program triple fails 1 verification condition: if the program starts in state v0 = -42, v1 = -98, v2 = -43, v3 = 34, v4 = 31, v5 = -80, the final environment v5 = -80, v4 = 31, v3 = 34 does not satisfy the postcondition. With this information, the correct program that satisfies the given precondition and postcondition is:
🎁 UnfoldML provides two API endpoints :
ℹ️ The API endpoints are parametric, so it's possible to train the model on small ASTs and test on larger ones, or longer programs, etc. to do small-to-large generalization trials.
v0
,v1
etc.[1] https://en.wikipedia.org/wiki/Correctness_(computer_science)
[2] https://en.wikipedia.org/wiki/Hoare_logic
cc @Muhtasham @vumichien and @lewtun @qgallouedec ^^