Skip to content

[worker, training_utils] fix: Metric Aggregation Across DP Ranks#5203

Merged
wuxibin89 merged 5 commits intoverl-project:mainfrom
JacobHelwig:jhelwig/fixDPMetrics
Feb 5, 2026
Merged

[worker, training_utils] fix: Metric Aggregation Across DP Ranks#5203
wuxibin89 merged 5 commits intoverl-project:mainfrom
JacobHelwig:jhelwig/fixDPMetrics

Conversation

@JacobHelwig
Copy link
Collaborator

@JacobHelwig JacobHelwig commented Feb 5, 2026

What does this PR do?

Fixes issue (#4778 (comment)) with metric aggregation across DP ranks resulting in sum-aggregated metrics scaling in magnitude with number of DP ranks

Thanks to @pengwu22 for pointing this out.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for metric aggregation across Data Parallel (DP) ranks, addressing an issue where sum-aggregated metrics were scaling incorrectly with the number of ranks. The new Metric.aggregate_dp method correctly handles different aggregation strategies for distributed scenarios, and this change is well-supported by new unit tests. However, I've identified a critical bug in the input validation logic of the new aggregate_dp function that fails to detect mismatched metric lengths and doesn't handle empty inputs, which could lead to incorrect calculations or crashes. I've provided a code suggestion to resolve this issue. Apart from this, the changes look good and effectively solve the problem.

@JacobHelwig
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes an issue with metric aggregation across data parallel ranks, preventing sum-aggregated metrics from scaling with the number of ranks. The new aggregate_dp method properly handles different aggregation types by averaging over DP ranks for SUM/MEAN and flattening for MIN/MAX. The accompanying tests are thorough and cover the new logic well. I've included one suggestion to refactor a small part of the new aggregate_dp function to improve clarity and remove redundant code.

@wuxibin89 wuxibin89 merged commit 2320603 into verl-project:main Feb 5, 2026
71 of 74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants