[MRG] FIX geometric_mean_score: macro/weighted average is mean of per-class G-means (#1096)#1173
Conversation
… G-means (scikit-learn-contrib#1096) The macro and weighted branches previously delegated averaging to sensitivity_specificity_support and then computed sqrt(sen * spe) on the macro/weighted-averaged scalars. That is the G-mean of mean-sensitivity and mean-specificity, NOT the (weighted) mean of per-class G-means — and the two disagree on binary inputs (the original report observed 0.745 vs 0.577). The fix computes the per-class (sen, spe) array first via average=None, derives the per-class G-mean array, and aggregates with np.mean (macro) or np.average weighted by class support (weighted). This makes geometric_mean_score(y_true, y_pred, average='macro') equal to np.mean(geometric_mean_score(y_true, y_pred, average=None)). binary / micro / None branches are unchanged. Existing tests with stale expected values (computed against the buggy code) are updated with comments explaining the math; a new parametrised non-regression test asserts the macro-equals-mean invariant on three input shapes. Co-Authored-By: Claude Code <noreply@anthropic.com>
| # For macro/weighted averaging the previous implementation passed | ||
| # ``average`` directly to sensitivity_specificity_support, which | ||
| # returns the macro/weighted average of sensitivity and | ||
| # specificity, and then took ``sqrt(sen * spe)`` of those scalars. | ||
| # That is the geometric mean of mean-sensitivity and | ||
| # mean-specificity, NOT the mean of per-class G-means — and the two | ||
| # disagree on binary inputs (see #1096). Compute the per-class | ||
| # G-mean first and aggregate, so that | ||
| # ``geometric_mean_score(..., average='macro')`` equals | ||
| # ``np.mean(geometric_mean_score(..., average=None))`` as users | ||
| # rightly expect. |
| # weighted: mean weighted by support (true samples per class). | ||
| # Mirror sklearn's behaviour and return 0 when the total | ||
| # support is zero rather than producing a NaN from 0/0. |
| # Non-regression test for #1096: the macro G-mean MUST equal the | ||
| # arithmetic mean of the per-class G-mean array. Before #1096 the | ||
| # macro path returned sqrt(macro_sen * macro_spe), which generally | ||
| # disagrees with mean(per_class_gmean) — most visibly on binary | ||
| # inputs, where the original report observed 0.745 (correct ~0.577) | ||
| # for y_true=[0,0,1,0,1,1], y_pred=[0,0,0,0,0,1]. |
There was a problem hiding this comment.
You don't need all this text. You only need to mention that it is a non-regression test and link the PR (because we did not have an issue)
| # macro / weighted updated in #1096: now mean(per-class gmean) | ||
| # rather than sqrt(macro_sen * macro_spe). Per-class G-means | ||
| # for this fixture are [0.82, 0.24, 0.72]; macro = mean = 0.59; | ||
| # weighted mean over class supports (which differ slightly due | ||
| # to the 50/50 train/test split) = 0.55. |
| # weighted G-mean with labels=[0, 1] and the given sample | ||
| # weights: per-class (sen, spe) restricted to [0, 1] are | ||
| # [(1.0, 0.5), (0.5, 0.0)] with supports [2, 4]; per-class | ||
| # G-means = [sqrt(0.5), 0] = [0.707, 0]; weighted average = | ||
| # (2*0.707 + 4*0)/6 = 0.236. Old expected 0.333 came from | ||
| # sqrt(weighted_sen * weighted_spe), the bug fixed in #1096. |
| # Macro / weighted G-mean is the (weighted) mean of the per-class | ||
| # G-mean array — which for this input is [0.866, 0, 0]. Their | ||
| # value is therefore 0.866/3 = 0.2887, NOT sqrt(macro_sen * | ||
| # macro_spe) = 0.471 as the metric used to return before #1096. |
|
I reviewed this against #1096 and the alternative PR #1164, and ran the suite locally — this is the correct and complete fix. A few notes for the maintainers: The bug is broader than the original report suggested. #1096 framed it as binary-only and said multiclass "works correctly", but the macro path is actually wrong for multiclass too — just by a smaller margin that's easy to miss. On the report's own multiclass fixture: y_true = [0, 1, 2, 0, 1, 2]; y_pred = [0, 2, 1, 0, 1, 2]
geometric_mean_score(y_true, y_pred, average=None) # [1. , 0.612, 0.612]
np.mean(...) # 0.7416 <- expected
geometric_mean_score(y_true, y_pred, average="macro") # 0.7454 <- master (wrong)This PR's approach — compute the per-class G-means and aggregate ( Things I checked and that look right:
One small suggestion: the Nice, thorough fix with a genuine invariant test. 👍 |
Reference Issue
Fixes #1096
What does this implement/fix? Explain your changes.
geometric_mean_score(y_true, y_pred, average='macro')did not returnthe (arithmetic) mean of the per-class G-means as users expect.
Concretely, on the binary example from #1096 it returned 0.745 even
though the per-class array was
[0.577, 0.577](mean = 0.577).The cause is in
imblearn/metrics/_classification.py. Foraverage in (None, 'binary', 'micro', 'macro', 'weighted', 'samples')the previous code did:
For
average='macro',sensitivity_specificity_supportreturns themacro-averaged sensitivity and specificity as scalars — taking
sqrt(sen * spe)of those gives the geometric mean of mean-sensitivityand mean-specificity, NOT the mean of per-class G-means. The two
disagree whenever the per-class (sen, spe) pairs are not all equal.
This is most visible on binary inputs where the macro path collapses to
a single (sen, spe) pair, but it also affects multi-class inputs.
The fix special-cases
macroandweighted: it computes the per-classsensitivity and specificity (
average=None), derives the per-classG-mean array, then aggregates with
np.mean(macro) ornp.average(weights=support)(weighted). The resulting macro G-meanexactly equals
np.mean(geometric_mean_score(..., average=None)). Thebinary,micro,samples,None, andmulticlassbranches areunchanged.
Reproduce BEFORE/AFTER yourself (copy-paste)
What I ran locally
pytest imblearn/metrics/→ 210 passed (208 existing + 2new parametrised case in
test_geometric_mean_macro_equals_mean_of_per_class).test_geometric_mean_macro_equals_mean_of_per_classfails onorigin/masterfor two of the three input shapes (the binary casefrom [BUG] geometric_mean_score with average='macro' #1096 plus a 3-class asymmetric case), confirming the bug
reaches beyond binary.
test_geometric_mean_average(macro0.471 → 0.2887,weighted0.471 → 0.2887),test_geometric_mean_sample_weight(
weighted0.333 → 0.236), andtest_geometric_mean_score_prediction(
macro0.67 → 0.59,weighted0.64 → 0.55), each with aninline comment explaining the math so a reviewer can re-derive
without trusting the diff.
geometric_mean_scoredocstring updated:macro0.471 → 0.288,weighted0.471 → 0.288.Edge cases tested
[0,0,1,0,1,1], [0,0,0,0,0,1]test_geometric_mean_macro_equals_mean_of_per_class[case-0][0,1,2,0,1,2], [0,2,1,0,1,2]test_geometric_mean_macro_equals_mean_of_per_class[case-2][0,1,0,1], [0,1,1,0]test_geometric_mean_macro_equals_mean_of_per_class[case-1]average='micro','binary',None,'multiclass'Risk / blast radius
Behavioural change for callers using
average='macro'or'weighted'. The previous numbers were not a documented invariant —they were buggy — but downstream code that asserted on the old values
will need to update. The doctest and existing parametrised tests in the
suite have been updated in this PR to reflect the correct values, with
inline comments explaining the math.
average='binary','micro',None, and'multiclass'areunchanged.
Any other comments?
doc/whats_new/v0.15.rstwith
:pr:\0`; I'll push a follow-up commit with the real PR number once GitHub assigns one (check-changelog.yml` requires itwhenever tests are modified).
[MRG]prefix on the title.PR drafted with assistance from Claude Code. The change was reviewed
manually against scikit-learn-contrib/imbalanced-learn's source and the
upstream spec/docs cited above. The reproducer block above was used
during development; it is the same one a reviewer can paste verbatim.