dynamic sparse rewards #528

vxnuaj · 2025-11-04T09:40:02Z

Description

Add sparse metrics support for mathematically correct domain averaging in multi-domain environments. This feature enables selective averaging that excludes irrelevant zero values, solving the domain dilution problem in composite evaluation environments like ProfBench.

Key improvements:

Chemistry domain: avg - 72.9 (relevant: 2/12) instead of diluted avg - 12.3
Physics domain: avg - 66.2 (relevant: 10/12) instead of diluted avg - 56.2
Visual distinction: Shows - for sparse values instead of misleading 0.0

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Manual Testing:
Tested with ProfBench environment showing correct sparse metrics behavior ( see bottom of this file for details ):

Domain-specific averages exclude irrelevant metrics
Sparse values display as - in output
(relevant: X/Y) info shows sparsity clearly

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

In environments like ProfBench, domain-specific scores get mixed with irrelevant zeros, making the averages misleading.

Example Issue:
Evaluating GPT-4 on 12 tasks: 10 physics + 2 chemistry tasks

physics_reward: [65, 72, 58, 81, 45, 67, 73, 59, 68, 74, 0, 0]  # zeros for chemistry tasks
chemistry_reward: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 88, 76]        # zeros for physics tasks

Before: physics_reward: avg - 56.2 (diluted by irrelevant zeros)
Before: chemistry_reward: avg - 13.7 (misleading!)

After,

physics_reward: [65, 72, 58, 81, 45, 67, 73, 59, 68, 74, -, -]  # zeros for chemistry tasks
chemistry_reward: [-, -, -, -, -, -, -, -, -, -, 88, 76]        # zeros for physics tasks

After: chemistry_reward: avg - 82.0 (relevant: 2/12) (actual chemistry skill)
After: physics_reward: avg - 66.2 (relevant: 10/12) (pure physics performance)

Which can all be done now within an EnvGroup with enable_sparse_metrics=True.

we can now

mark irrelevant values as sparse during scoring
exclude sparse values from averaging calculations
display sparsity clearly with - instead of 0.0
maintain backwards compatibility with existing environments

Core

1. type extensions @ `types.py`

New Fields Added:

class RolloutScore(BaseModel):
    sparse_metrics: set[str] | None = Field(default=None)
    # set of metric names to exclude from averaging for this rollout

class RolloutScores(BaseModel): 
    sparse_metrics: dict[str, list[bool]] | None = Field(default=None)
    # per-rolout exclusion flags for batch scoring

class GenerateOutputs(BaseModel):
    sparse_metrics: dict[str, list[bool]] | None = Field(default=None)
    # final sparse tracking for evaluation results

THis tracks which metric values should be excluded from averaging calculations.

2. Environment Sparse Tracking @ `envs/environment.py`

Key Changes:

Initialize sparse flags for all metrics during interleaved scoring
Track sparse metrics from rubric scoring results
Conditionally assign sparse_metrics only if sparsity detected (backwards compatible)

# Initialize sparse tracking
sparse_flags: dict[str, list[bool]] = {name: [False] * n for name in reward_func_names}

# Process sparse flags from scoring
if rs.sparse_metrics:
    for sparse_key in rs.sparse_metrics:
        sparse_flags[sparse_key][i] = True

# Only add if sparsity detected (backwards compatible)
if any(any(flags) for flags in sparse_flags.values()):
    results.sparse_metrics = sparse_flags

this collects and aggregates sparse metadata during evaluation execution.

3. Batch Scoring with Sparse Handling @ `rubrics/rubric.py`

Key Changes:

Collect all metric keys across rollouts (handles mixed metrics)
Fill missing metrics with 0.0 and mark as sparse
Track sparsity flags from individual rollout scores
Return sparse metadata only if sparsity detected

# Handle missing metrics as sparse
if k in reward.metrics:
    metrics[k].append(reward.metrics[k])
    is_sparse = reward.sparse_metrics and k in reward.sparse_metrics
    sparse_flags[k].append(is_sparse)
else:
    # Missing metric -> sparse 0.0
    metrics[k].append(0.0)
    sparse_flags[k].append(True)

ensure consistent metric structure while preserving sparsity information.

4. EnvGroup Sparse Architecture @ `envs/env_group.py`)

New Class: EnvGroupSparseRubric

Extends standard EnvGroupRubric with domain-specific sparse marking:

class EnvGroupSparseRubric(EnvGroupRubric):
    async def score_rollout(self, ...):
        # Route to domain-specific environment
        env_results = await env.rubric.score_rollout(...)
        
        # Mark uncomputed metrics as sparse
        uncomputed_metrics = set(all_rewards) - set(env_results.metrics.keys())
        sparse_metrics = uncomputed_metrics if uncomputed_metrics else None
        
        return RolloutScore(sparse_metrics=sparse_metrics, ...)

Activation Logic:

# Key decision point for sparse metrics
if enable_sparse_metrics:
    rubric = EnvGroupSparseRubric(self.env_map)  # Sparse-aware
else:
    rubric = EnvGroupRubric(self.env_map)       # Standard (backwards compatible)

automatically mark domain-specific metrics as sparse when irrelevant.

5. Sparse-Aware Display @ `utils/eval_utils.py`

Selective Averaging:

# Filter out sparse values before averaging
sparse_flags = results.sparse_metrics[k]
relevant_values = [val for val, is_sparse in zip(v, sparse_flags) if not is_sparse]

if relevant_values:
    avg = sum(relevant_values) / len(relevant_values)
    sparsity_info = f" (relevant: {len(relevant_values)}/{len(v)})"
    print(f"{k}: avg - {avg:.3f}{sparsity_info}")
else:
    print(f"{k}: no relevant data (all values sparse)")

Enhanced Display:

# Show "-" for sparse values instead of misleading 0.0
if sparse_flags[idx]:
    trials.append("-")        # Sparse (excluded from averaging)
else:
    trials.append(round(v[idx], 3))  # Actual computed value

provide mathematically correct averages and clear visual distinction of sparsity.

Usage

# Standard behavior (backwards compatible)
env = vf.EnvGroup(envs, names)                           # Standard averaging

# Sparse metrics enabled
env = vf.EnvGroup(envs, names, enable_sparse_metrics=True)  # Selective averaging

def load_environment(enable_sparse_metrics: bool = True):
    return vf.EnvGroup(
        envs=domain_envs,
        env_names=domain_names, 
        enable_sparse_metrics=enable_sparse_metrics
    )

To Test:

To test sparse metrics with ProfBench:

Pull the ProfBench environment changes ( this env is a bounty in progress fyi ):

git clone https://github.com/vxnuaj/prime-environments.git -b vxnuaj/profbench
cd prime-environments

Pull this verifiers fork / pr with sparse metrics implementation:

git clone https://github.com/vxnuaj/verifiers.git -b vxnuaj/dynamic-sparse-rewards

Install verifiers in editable mode:
```
cd verifiers
uv pip install -e .
```

Run evaluation to see sparse metrics in action:

vf-eval -s profbench -m gpt-4.1-mini --env-args '{"judge_model": "openai/gpt-4.1-mini"}' -n 12 -r 1 
# -n must be >= 10 for sparsity to be detected, as if we do less, then profbench only loads from the first domain ( i believe physics or chemistry )
# feel free to do -r x \in R^n

…-rewards

CLAassistant · 2025-11-04T09:40:11Z

All committers have signed the CLA.

mikasenghaas

great! thanks for taking a first stab at this. initial vibes is that we shouldnt try to be as "backwards compatible" as possible here but rather support "a reward function might not be defined for this task type" more first-class. i think this will be cleaner and result in a much smaller diff than currently.

in terms of timing: we have a pretty big refactor upcoming, so we will likely only tackle this problem afterwards, but will defo work off the concepts you introduced here!

mikasenghaas · 2025-11-05T17:13:15Z

pyproject.toml

    "tomli; python_version < '3.11'",
    "prime-sandboxes>=0.1.0",
    "wget>=3.2",
+    "torch>=2.8.0",


prob leftover from smth else?

mikasenghaas · 2025-11-05T17:18:38Z

verifiers/types.py


    reward: list[float]
    metrics: dict[str, list[float]] = Field(default_factory=dict)
+    sparse_metrics: dict[str, list[bool]] | None = Field(default=None)


i dont like the name sparse metrics here. to me, this implies that this is the actual float metrics after filtering. would prefer a name that is indicative of the fact that these are boolean flags, maybe smth like has_reward_fn (not sure)

mikasenghaas · 2025-11-05T17:19:18Z

verifiers/utils/eval_utils.py

+        # only average over relevant (non-sparse) values
+        # instead of including misleading zeros in the calculation
+        if (
+            hasattr(results, "sparse_metrics")


this is always true?

mikasenghaas · 2025-11-05T17:19:58Z

verifiers/utils/eval_utils.py

+        ):
+            # filter out sparse values from averaging calculation
+            # sparse_flags[i] = True means exclude rollout i from averaging
+            sparse_flags = results.sparse_metrics[k]


ah yea look here you call it sparse_flags as well haha, this is alr better than sparse_metrics

vxnuaj added 4 commits November 4, 2025 01:15

Add dynamic sparse rewards

a8f6e49

Merge remote-tracking branch 'origin/main' into vxnuaj/dynamic-sparse…

fc6ec2b

…-rewards

edit PR.md

fe41304

ruff, pre-commit, pytest

849b1c1

remove uneeded files

95a9546

vxnuaj mentioned this pull request Nov 4, 2025

ProfBench [ WIP ] PrimeIntellect-ai/prime-environments#295

Draft

14 tasks

vxnuaj changed the title ~~Vxnuaj/dynamic sparse rewards~~ dynamic sparse rewards Nov 4, 2025

remove uneeded pr bloat

12a0aad

mikasenghaas reviewed Nov 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dynamic sparse rewards #528

dynamic sparse rewards #528

Uh oh!

vxnuaj commented Nov 4, 2025

Uh oh!

CLAassistant commented Nov 4, 2025 •

edited

Loading

Uh oh!

mikasenghaas left a comment

Uh oh!

mikasenghaas Nov 5, 2025

Uh oh!

mikasenghaas Nov 5, 2025

Uh oh!

mikasenghaas Nov 5, 2025

Uh oh!

mikasenghaas Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dynamic sparse rewards #528

Are you sure you want to change the base?

dynamic sparse rewards #528

Uh oh!

Conversation

vxnuaj commented Nov 4, 2025

Description

Type of Change

Testing

Checklist

Additional Notes

Core

1. type extensions @ types.py

2. Environment Sparse Tracking @ envs/environment.py

3. Batch Scoring with Sparse Handling @ rubrics/rubric.py

4. EnvGroup Sparse Architecture @ envs/env_group.py)

5. Sparse-Aware Display @ utils/eval_utils.py

Usage

To Test:

Uh oh!

CLAassistant commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikasenghaas left a comment

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. type extensions @ `types.py`

2. Environment Sparse Tracking @ `envs/environment.py`

3. Batch Scoring with Sparse Handling @ `rubrics/rubric.py`

4. EnvGroup Sparse Architecture @ `envs/env_group.py`)

5. Sparse-Aware Display @ `utils/eval_utils.py`

CLAassistant commented Nov 4, 2025 •

edited

Loading