Bug in the best run selection in calculate_metrics.py #9

piojanu · 2025-02-25T11:16:17Z

Hi!

In the paper you write:

To improve evaluation stability, we repeat each task with three independent runs in all experiments. Then we select the best run according to the metrics in the following order: maximum SR, maximum VER, maximum CBS, and minimum Cost. We refer to the next metric in this order to break ties. For example, if two programs generated for a task both have SR = 0, we pick the one with higher VER.

However, this is not how the code works right now. The bug is in the tie-breaking mechanism of calculate_metrics.py. When breaking ties, the code incorrectly searches through all trajectories instead of searching only within the filtered candidates that passed all previous criteria. This means it could select a trajectory with the minimum cost that doesn't actually have the best success_rate, valid_program, and codebert_score values from earlier filtering steps. The same issue exists in all tie-breaking sections.

I propose a fix here #8.

The text was updated successfully, but these errors were encountered:

ronch99 self-assigned this Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in the best run selection in calculate_metrics.py #9

Bug in the best run selection in calculate_metrics.py #9

piojanu commented Feb 25, 2025

Bug in the best run selection in calculate_metrics.py #9

Bug in the best run selection in calculate_metrics.py #9

Comments

piojanu commented Feb 25, 2025