Skip to content

[MRG] FIX geometric_mean_score: macro/weighted average is mean of per-class G-means (#1096)#1173

Open
jbbqqf wants to merge 2 commits into
scikit-learn-contrib:masterfrom
jbbqqf:feat/1096-fix-gmean-macro-binary
Open

[MRG] FIX geometric_mean_score: macro/weighted average is mean of per-class G-means (#1096)#1173
jbbqqf wants to merge 2 commits into
scikit-learn-contrib:masterfrom
jbbqqf:feat/1096-fix-gmean-macro-binary

Conversation

@jbbqqf

@jbbqqf jbbqqf commented May 9, 2026

Copy link
Copy Markdown

Reference Issue

Fixes #1096

What does this implement/fix? Explain your changes.

geometric_mean_score(y_true, y_pred, average='macro') did not return
the (arithmetic) mean of the per-class G-means as users expect.
Concretely, on the binary example from #1096 it returned 0.745 even
though the per-class array was [0.577, 0.577] (mean = 0.577).

The cause is in imblearn/metrics/_classification.py. For
average in (None, 'binary', 'micro', 'macro', 'weighted', 'samples')
the previous code did:

sen, spe, _ = sensitivity_specificity_support(..., average=average)
return np.sqrt(sen * spe)

For average='macro', sensitivity_specificity_support returns the
macro-averaged sensitivity and specificity as scalars — taking
sqrt(sen * spe) of those gives the geometric mean of mean-sensitivity
and mean-specificity, NOT the mean of per-class G-means. The two
disagree whenever the per-class (sen, spe) pairs are not all equal.
This is most visible on binary inputs where the macro path collapses to
a single (sen, spe) pair, but it also affects multi-class inputs.

The fix special-cases macro and weighted: it computes the per-class
sensitivity and specificity (average=None), derives the per-class
G-mean array, then aggregates with np.mean (macro) or
np.average(weights=support) (weighted). The resulting macro G-mean
exactly equals np.mean(geometric_mean_score(..., average=None)). The
binary, micro, samples, None, and multiclass branches are
unchanged.

Reproduce BEFORE/AFTER yourself (copy-paste)
# --- one-time setup ---
git clone https://github.com/scikit-learn-contrib/imbalanced-learn.git /tmp/repro-1096 && cd /tmp/repro-1096
python -m venv .venv && source .venv/bin/activate
pip install -q -e '.[tests]'

# --- BEFORE (origin/master) ---
git checkout origin/master
python -c "
import numpy as np
from imblearn.metrics import geometric_mean_score
y_true = [0, 0, 1, 0, 1, 1]; y_pred = [0, 0, 0, 0, 0, 1]
per_class = geometric_mean_score(y_true, y_pred, average=None)
macro = geometric_mean_score(y_true, y_pred, average='macro')
print(f'per-class: {per_class}, macro: {macro:.4f}, mean: {np.mean(per_class):.4f}')
"
# Expected: per-class: [0.5774 0.5774], macro: 0.7454, mean: 0.5774
#   (macro disagrees with mean of per-class)

# --- AFTER (this PR) ---
git fetch https://github.com/jbbqqf/imbalanced-learn.git feat/1096-fix-gmean-macro-binary
git checkout FETCH_HEAD
python -c "
import numpy as np
from imblearn.metrics import geometric_mean_score
y_true = [0, 0, 1, 0, 1, 1]; y_pred = [0, 0, 0, 0, 0, 1]
per_class = geometric_mean_score(y_true, y_pred, average=None)
macro = geometric_mean_score(y_true, y_pred, average='macro')
print(f'per-class: {per_class}, macro: {macro:.4f}, mean: {np.mean(per_class):.4f}')
"
# Expected: per-class: [0.5774 0.5774], macro: 0.5774, mean: 0.5774
#   (macro now equals mean of per-class)
What I ran locally
  • pytest imblearn/metrics/ → 210 passed (208 existing + 2
    new parametrised case in
    test_geometric_mean_macro_equals_mean_of_per_class).
  • The new parametrised non-regression test
    test_geometric_mean_macro_equals_mean_of_per_class fails on
    origin/master for two of the three input shapes (the binary case
    from [BUG] geometric_mean_score with average='macro' #1096 plus a 3-class asymmetric case), confirming the bug
    reaches beyond binary.
  • Updated stale expected values in
    test_geometric_mean_average (macro 0.471 → 0.2887,
    weighted 0.471 → 0.2887), test_geometric_mean_sample_weight
    (weighted 0.333 → 0.236), and test_geometric_mean_score_prediction
    (macro 0.67 → 0.59, weighted 0.64 → 0.55), each with an
    inline comment explaining the math so a reviewer can re-derive
    without trusting the diff.
  • Doctest in the geometric_mean_score docstring updated:
    macro 0.471 → 0.288, weighted 0.471 → 0.288.
Edge cases tested
# Scenario Input Expected Verified by
1 Binary, asymmetric per-class (#1096 case) [0,0,1,0,1,1], [0,0,0,0,0,1] macro = mean = 0.577 test_geometric_mean_macro_equals_mean_of_per_class[case-0]
2 3-class, asymmetric errors [0,1,2,0,1,2], [0,2,1,0,1,2] macro = mean(per-class) = 0.742 test_geometric_mean_macro_equals_mean_of_per_class[case-2]
3 Binary, fully-balanced [0,1,0,1], [0,1,1,0] macro = mean = 0.5 (sanity) test_geometric_mean_macro_equals_mean_of_per_class[case-1]
4 average='micro', 'binary', None, 'multiclass' various unchanged from previous behaviour existing tests still green
Risk / blast radius

Behavioural change for callers using average='macro' or
'weighted'. The previous numbers were not a documented invariant —
they were buggy — but downstream code that asserted on the old values
will need to update. The doctest and existing parametrised tests in the
suite have been updated in this PR to reflect the correct values, with
inline comments explaining the math.

average='binary', 'micro', None, and 'multiclass' are
unchanged.

Any other comments?

  • Changelog entry added under "Bug fixes" in doc/whats_new/v0.15.rst
    with :pr:\0`; I'll push a follow-up commit with the real PR number once GitHub assigns one (check-changelog.yml` requires it
    whenever tests are modified).
  • [MRG] prefix on the title.
FIX :func:`imblearn.metrics.geometric_mean_score` with
``average='macro'`` (and ``'weighted'``) now returns the (weighted)
mean of the per-class G-mean array, consistent with
``np.mean(geometric_mean_score(..., average=None))``. The previous
behaviour returned ``sqrt(macro_sen * macro_spe)``.

PR drafted with assistance from Claude Code. The change was reviewed
manually against scikit-learn-contrib/imbalanced-learn's source and the
upstream spec/docs cited above. The reproducer block above was used
during development; it is the same one a reviewer can paste verbatim.

jbbqqf and others added 2 commits May 9, 2026 20:12
… G-means (scikit-learn-contrib#1096)

The macro and weighted branches previously delegated averaging to
sensitivity_specificity_support and then computed sqrt(sen * spe) on
the macro/weighted-averaged scalars. That is the G-mean of
mean-sensitivity and mean-specificity, NOT the (weighted) mean of
per-class G-means — and the two disagree on binary inputs (the
original report observed 0.745 vs 0.577).

The fix computes the per-class (sen, spe) array first via
average=None, derives the per-class G-mean array, and aggregates with
np.mean (macro) or np.average weighted by class support (weighted).
This makes geometric_mean_score(y_true, y_pred, average='macro')
equal to np.mean(geometric_mean_score(y_true, y_pred, average=None)).

binary / micro / None branches are unchanged. Existing tests with
stale expected values (computed against the buggy code) are updated
with comments explaining the math; a new parametrised non-regression
test asserts the macro-equals-mean invariant on three input shapes.

Co-Authored-By: Claude Code <noreply@anthropic.com>
Comment on lines +674 to +684
# For macro/weighted averaging the previous implementation passed
# ``average`` directly to sensitivity_specificity_support, which
# returns the macro/weighted average of sensitivity and
# specificity, and then took ``sqrt(sen * spe)`` of those scalars.
# That is the geometric mean of mean-sensitivity and
# mean-specificity, NOT the mean of per-class G-means — and the two
# disagree on binary inputs (see #1096). Compute the per-class
# G-mean first and aggregate, so that
# ``geometric_mean_score(..., average='macro')`` equals
# ``np.mean(geometric_mean_score(..., average=None))`` as users
# rightly expect.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this comment.

Comment on lines +698 to +700
# weighted: mean weighted by support (true samples per class).
# Mirror sklearn's behaviour and return 0 when the total
# support is zero rather than producing a NaN from 0/0.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also no need for this.

Comment on lines +319 to +324
# Non-regression test for #1096: the macro G-mean MUST equal the
# arithmetic mean of the per-class G-mean array. Before #1096 the
# macro path returned sqrt(macro_sen * macro_spe), which generally
# disagrees with mean(per_class_gmean) — most visibly on binary
# inputs, where the original report observed 0.745 (correct ~0.577)
# for y_true=[0,0,1,0,1,1], y_pred=[0,0,0,0,0,1].

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need all this text. You only need to mention that it is a non-regression test and link the PR (because we did not have an issue)

Comment on lines +291 to +295
# macro / weighted updated in #1096: now mean(per-class gmean)
# rather than sqrt(macro_sen * macro_spe). Per-class G-means
# for this fixture are [0.82, 0.24, 0.72]; macro = mean = 0.59;
# weighted mean over class supports (which differ slightly due
# to the 50/50 train/test split) = 0.55.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this comment

Comment on lines +258 to +263
# weighted G-mean with labels=[0, 1] and the given sample
# weights: per-class (sen, spe) restricted to [0, 1] are
# [(1.0, 0.5), (0.5, 0.0)] with supports [2, 4]; per-class
# G-means = [sqrt(0.5), 0] = [0.707, 0]; weighted average =
# (2*0.707 + 4*0)/6 = 0.236. Old expected 0.333 came from
# sqrt(weighted_sen * weighted_spe), the bug fixed in #1096.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this comment

Comment on lines +232 to +235
# Macro / weighted G-mean is the (weighted) mean of the per-class
# G-mean array — which for this input is [0.866, 0, 0]. Their
# value is therefore 0.866/3 = 0.2887, NOT sqrt(macro_sen *
# macro_spe) = 0.471 as the metric used to return before #1096.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this comment.

@immu4989

immu4989 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

I reviewed this against #1096 and the alternative PR #1164, and ran the suite locally — this is the correct and complete fix. A few notes for the maintainers:

The bug is broader than the original report suggested. #1096 framed it as binary-only and said multiclass "works correctly", but the macro path is actually wrong for multiclass too — just by a smaller margin that's easy to miss. On the report's own multiclass fixture:

y_true = [0, 1, 2, 0, 1, 2]; y_pred = [0, 2, 1, 0, 1, 2]
geometric_mean_score(y_true, y_pred, average=None)    # [1. , 0.612, 0.612]
np.mean(...)                                           # 0.7416   <- expected
geometric_mean_score(y_true, y_pred, average="macro")  # 0.7454   <- master (wrong)

This PR's approach — compute the per-class G-means and aggregate (np.mean for macro, support-weighted for weighted) — fixes both binary and multiclass in one path, because it removes the root cause (sqrt(macro_sen * macro_spe)mean(sqrt(sen*spe))) rather than special-casing it. That's the right call; PR #1164 only branches on is_binary and leaves the multiclass macro value wrong.

Things I checked and that look right:

  • average="micro" is untouched and still routes through the old path — correct, since the issue is specific to macro/weighted.
  • The weighted branch guards sup.sum() == 0 -> 0.0, avoiding a 0/0 NaN; matches sklearn's convention.
  • The updated docstring/expected values and the test_geometric_mean_macro_equals_mean_of_per_class invariant test are the right property to lock in.
  • Locally, pytest imblearn/metrics/tests/test_classification.py -k geometric → 23 passed.

One small suggestion: the weighted average is weighted by sup (true samples per class). Worth a one-line code comment stating that explicitly, since "weighted" could otherwise be read as weighting by predicted frequency — but this matches the per-class semantics, so it's correct as written.

Nice, thorough fix with a genuine invariant test. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] geometric_mean_score with average='macro'

3 participants