-
Notifications
You must be signed in to change notification settings - Fork 429
dynamic sparse rewards #528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
dynamic sparse rewards #528
Conversation
mikasenghaas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great! thanks for taking a first stab at this. initial vibes is that we shouldnt try to be as "backwards compatible" as possible here but rather support "a reward function might not be defined for this task type" more first-class. i think this will be cleaner and result in a much smaller diff than currently.
in terms of timing: we have a pretty big refactor upcoming, so we will likely only tackle this problem afterwards, but will defo work off the concepts you introduced here!
| "tomli; python_version < '3.11'", | ||
| "prime-sandboxes>=0.1.0", | ||
| "wget>=3.2", | ||
| "torch>=2.8.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob leftover from smth else?
|
|
||
| reward: list[float] | ||
| metrics: dict[str, list[float]] = Field(default_factory=dict) | ||
| sparse_metrics: dict[str, list[bool]] | None = Field(default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont like the name sparse metrics here. to me, this implies that this is the actual float metrics after filtering. would prefer a name that is indicative of the fact that these are boolean flags, maybe smth like has_reward_fn (not sure)
| # only average over relevant (non-sparse) values | ||
| # instead of including misleading zeros in the calculation | ||
| if ( | ||
| hasattr(results, "sparse_metrics") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is always true?
| ): | ||
| # filter out sparse values from averaging calculation | ||
| # sparse_flags[i] = True means exclude rollout i from averaging | ||
| sparse_flags = results.sparse_metrics[k] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yea look here you call it sparse_flags as well haha, this is alr better than sparse_metrics
Description
Add sparse metrics support for mathematically correct domain averaging in multi-domain environments. This feature enables selective averaging that excludes irrelevant zero values, solving the domain dilution problem in composite evaluation environments like ProfBench.
Key improvements:
avg - 72.9 (relevant: 2/12)instead of dilutedavg - 12.3avg - 66.2 (relevant: 10/12)instead of dilutedavg - 56.2-for sparse values instead of misleading0.0Type of Change
Testing
uv run pytestlocally.Manual Testing:
Tested with ProfBench environment showing correct sparse metrics behavior ( see bottom of this file for details ):
-in output(relevant: X/Y)info shows sparsity clearlyChecklist
Additional Notes
In environments like
ProfBench, domain-specific scores get mixed with irrelevant zeros, making the averages misleading.Example Issue:
Evaluating GPT-4 on 12 tasks: 10 physics + 2 chemistry tasks
physics_reward: avg - 56.2(diluted by irrelevant zeros)chemistry_reward: avg - 13.7(misleading!)After,
chemistry_reward: avg - 82.0 (relevant: 2/12)(actual chemistry skill)physics_reward: avg - 66.2 (relevant: 10/12)(pure physics performance)Which can all be done now within an
EnvGroupwithenable_sparse_metrics=True.we can now
-instead of0.0Core
1. type extensions @
types.pyNew Fields Added:
THis tracks which metric values should be excluded from averaging calculations.
2. Environment Sparse Tracking @
envs/environment.pyKey Changes:
this collects and aggregates sparse metadata during evaluation execution.
3. Batch Scoring with Sparse Handling @
rubrics/rubric.pyKey Changes:
ensure consistent metric structure while preserving sparsity information.
4. EnvGroup Sparse Architecture @
envs/env_group.py)New Class:
EnvGroupSparseRubricExtends standard
EnvGroupRubricwith domain-specific sparse marking:Activation Logic:
automatically mark domain-specific metrics as sparse when irrelevant.
5. Sparse-Aware Display @
utils/eval_utils.pySelective Averaging:
Enhanced Display:
provide mathematically correct averages and clear visual distinction of sparsity.
Usage
To Test:
To test sparse metrics with ProfBench:
Pull the ProfBench environment changes ( this env is a bounty in progress fyi ):
git clone https://github.com/vxnuaj/prime-environments.git -b vxnuaj/profbench cd prime-environmentsPull this verifiers fork / pr with sparse metrics implementation:
Install verifiers in editable mode:
Run evaluation to see sparse metrics in action: