You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To improve evaluation stability, we repeat each task with three independent runs in all experiments. Then we select the best run according to the metrics in the following order: maximum SR, maximum VER, maximum CBS, and minimum Cost. We refer to the next metric in this order to break ties. For example, if two programs generated for a task both have SR = 0, we pick the one with higher VER.
However, this is not how the code works right now. The bug is in the tie-breaking mechanism of calculate_metrics.py. When breaking ties, the code incorrectly searches through all trajectories instead of searching only within the filtered candidates that passed all previous criteria. This means it could select a trajectory with the minimum cost that doesn't actually have the best success_rate, valid_program, and codebert_score values from earlier filtering steps. The same issue exists in all tie-breaking sections.
Hi!
In the paper you write:
However, this is not how the code works right now. The bug is in the tie-breaking mechanism of
calculate_metrics.py
. When breaking ties, the code incorrectly searches through all trajectories instead of searching only within the filtered candidates that passed all previous criteria. This means it could select a trajectory with the minimum cost that doesn't actually have the bestsuccess_rate
,valid_program
, andcodebert_score
values from earlier filtering steps. The same issue exists in all tie-breaking sections.I propose a fix here #8.
The text was updated successfully, but these errors were encountered: