[worker, training_utils] fix: Metric Aggregation Across DP Ranks#5203
[worker, training_utils] fix: Metric Aggregation Across DP Ranks#5203wuxibin89 merged 5 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a fix for metric aggregation across Data Parallel (DP) ranks, addressing an issue where sum-aggregated metrics were scaling incorrectly with the number of ranks. The new Metric.aggregate_dp method correctly handles different aggregation strategies for distributed scenarios, and this change is well-supported by new unit tests. However, I've identified a critical bug in the input validation logic of the new aggregate_dp function that fails to detect mismatched metric lengths and doesn't handle empty inputs, which could lead to incorrect calculations or crashes. I've provided a code suggestion to resolve this issue. Apart from this, the changes look good and effectively solve the problem.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request correctly fixes an issue with metric aggregation across data parallel ranks, preventing sum-aggregated metrics from scaling with the number of ranks. The new aggregate_dp method properly handles different aggregation types by averaging over DP ranks for SUM/MEAN and flattening for MIN/MAX. The accompanying tests are thorough and cover the new logic well. I've included one suggestion to refactor a small part of the new aggregate_dp function to improve clarity and remove redundant code.
What does this PR do?
Fixes issue (#4778 (comment)) with metric aggregation across DP ranks resulting in sum-aggregated metrics scaling in magnitude with number of DP ranks
Thanks to @pengwu22 for pointing this out.