Skip to content

Bug in the best run selection in calculate_metrics.py #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
piojanu opened this issue Feb 25, 2025 · 0 comments
Open

Bug in the best run selection in calculate_metrics.py #9

piojanu opened this issue Feb 25, 2025 · 0 comments
Assignees

Comments

@piojanu
Copy link

piojanu commented Feb 25, 2025

Hi!

In the paper you write:

To improve evaluation stability, we repeat each task with three independent runs in all experiments. Then we select the best run according to the metrics in the following order: maximum SR, maximum VER, maximum CBS, and minimum Cost. We refer to the next metric in this order to break ties. For example, if two programs generated for a task both have SR = 0, we pick the one with higher VER.

However, this is not how the code works right now. The bug is in the tie-breaking mechanism of calculate_metrics.py. When breaking ties, the code incorrectly searches through all trajectories instead of searching only within the filtered candidates that passed all previous criteria. This means it could select a trajectory with the minimum cost that doesn't actually have the best success_rate, valid_program, and codebert_score values from earlier filtering steps. The same issue exists in all tie-breaking sections.

I propose a fix here #8.

@ronch99 ronch99 self-assigned this Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants