[MRG] FIX geometric_mean_score: macro/weighted average is mean of per-class G-means (#1096) by jbbqqf · Pull Request #1173 · scikit-learn-contrib/imbalanced-learn

jbbqqf · 2026-05-09T18:20:22Z

Reference Issue

What does this implement/fix? Explain your changes.

geometric_mean_score(y_true, y_pred, average='macro') did not return
the (arithmetic) mean of the per-class G-means as users expect.
Concretely, on the binary example from #1096 it returned 0.745 even
though the per-class array was [0.577, 0.577] (mean = 0.577).

The cause is in imblearn/metrics/_classification.py. For
average in (None, 'binary', 'micro', 'macro', 'weighted', 'samples')
the previous code did:

sen, spe, _ = sensitivity_specificity_support(..., average=average)
return np.sqrt(sen * spe)

For average='macro', sensitivity_specificity_support returns the
macro-averaged sensitivity and specificity as scalars — taking
sqrt(sen * spe) of those gives the geometric mean of mean-sensitivity
and mean-specificity, NOT the mean of per-class G-means. The two
disagree whenever the per-class (sen, spe) pairs are not all equal.
This is most visible on binary inputs where the macro path collapses to
a single (sen, spe) pair, but it also affects multi-class inputs.

The fix special-cases macro and weighted: it computes the per-class
sensitivity and specificity (average=None), derives the per-class
G-mean array, then aggregates with np.mean (macro) or
np.average(weights=support) (weighted). The resulting macro G-mean
exactly equals np.mean(geometric_mean_score(..., average=None)). The
binary, micro, samples, None, and multiclass branches are
unchanged.

Reproduce BEFORE/AFTER yourself (copy-paste)

# --- one-time setup ---
git clone https://github.com/scikit-learn-contrib/imbalanced-learn.git /tmp/repro-1096 && cd /tmp/repro-1096
python -m venv .venv && source .venv/bin/activate
pip install -q -e '.[tests]'

# --- BEFORE (origin/master) ---
git checkout origin/master
python -c "
import numpy as np
from imblearn.metrics import geometric_mean_score
y_true = [0, 0, 1, 0, 1, 1]; y_pred = [0, 0, 0, 0, 0, 1]
per_class = geometric_mean_score(y_true, y_pred, average=None)
macro = geometric_mean_score(y_true, y_pred, average='macro')
print(f'per-class: {per_class}, macro: {macro:.4f}, mean: {np.mean(per_class):.4f}')
"
# Expected: per-class: [0.5774 0.5774], macro: 0.7454, mean: 0.5774
#   (macro disagrees with mean of per-class)

# --- AFTER (this PR) ---
git fetch https://github.com/jbbqqf/imbalanced-learn.git feat/1096-fix-gmean-macro-binary
git checkout FETCH_HEAD
python -c "
import numpy as np
from imblearn.metrics import geometric_mean_score
y_true = [0, 0, 1, 0, 1, 1]; y_pred = [0, 0, 0, 0, 0, 1]
per_class = geometric_mean_score(y_true, y_pred, average=None)
macro = geometric_mean_score(y_true, y_pred, average='macro')
print(f'per-class: {per_class}, macro: {macro:.4f}, mean: {np.mean(per_class):.4f}')
"
# Expected: per-class: [0.5774 0.5774], macro: 0.5774, mean: 0.5774
#   (macro now equals mean of per-class)

What I ran locally

pytest imblearn/metrics/ → 210 passed (208 existing + 2
new parametrised case in
test_geometric_mean_macro_equals_mean_of_per_class).
The new parametrised non-regression test
test_geometric_mean_macro_equals_mean_of_per_class fails on
origin/master for two of the three input shapes (the binary case
from [BUG] geometric_mean_score with average='macro' #1096 plus a 3-class asymmetric case), confirming the bug
reaches beyond binary.
Updated stale expected values in
test_geometric_mean_average (macro 0.471 → 0.2887,
weighted 0.471 → 0.2887), test_geometric_mean_sample_weight
(weighted 0.333 → 0.236), and test_geometric_mean_score_prediction
(macro 0.67 → 0.59, weighted 0.64 → 0.55), each with an
inline comment explaining the math so a reviewer can re-derive
without trusting the diff.
Doctest in the geometric_mean_score docstring updated:
macro 0.471 → 0.288, weighted 0.471 → 0.288.

Edge cases tested

#	Scenario	Input	Expected	Verified by
1	Binary, asymmetric per-class (#1096 case)	`[0,0,1,0,1,1], [0,0,0,0,0,1]`	macro = mean = 0.577	`test_geometric_mean_macro_equals_mean_of_per_class[case-0]`
2	3-class, asymmetric errors	`[0,1,2,0,1,2], [0,2,1,0,1,2]`	macro = mean(per-class) = 0.742	`test_geometric_mean_macro_equals_mean_of_per_class[case-2]`
3	Binary, fully-balanced	`[0,1,0,1], [0,1,1,0]`	macro = mean = 0.5 (sanity)	`test_geometric_mean_macro_equals_mean_of_per_class[case-1]`
4	`average='micro'`, `'binary'`, `None`, `'multiclass'`	various	unchanged from previous behaviour	existing tests still green

Risk / blast radius

Behavioural change for callers using average='macro' or
'weighted'. The previous numbers were not a documented invariant —
they were buggy — but downstream code that asserted on the old values
will need to update. The doctest and existing parametrised tests in the
suite have been updated in this PR to reflect the correct values, with
inline comments explaining the math.

average='binary', 'micro', None, and 'multiclass' are
unchanged.

Any other comments?

Changelog entry added under "Bug fixes" in doc/whats_new/v0.15.rst
with :pr:\0`; I'll push a follow-up commit with the real PR number once GitHub assigns one (check-changelog.yml` requires it
whenever tests are modified).
[MRG] prefix on the title.

FIX :func:`imblearn.metrics.geometric_mean_score` with
``average='macro'`` (and ``'weighted'``) now returns the (weighted)
mean of the per-class G-mean array, consistent with
``np.mean(geometric_mean_score(..., average=None))``. The previous
behaviour returned ``sqrt(macro_sen * macro_spe)``.

PR drafted with assistance from Claude Code. The change was reviewed
manually against scikit-learn-contrib/imbalanced-learn's source and the
upstream spec/docs cited above. The reproducer block above was used
during development; it is the same one a reviewer can paste verbatim.

… G-means (scikit-learn-contrib#1096) The macro and weighted branches previously delegated averaging to sensitivity_specificity_support and then computed sqrt(sen * spe) on the macro/weighted-averaged scalars. That is the G-mean of mean-sensitivity and mean-specificity, NOT the (weighted) mean of per-class G-means — and the two disagree on binary inputs (the original report observed 0.745 vs 0.577). The fix computes the per-class (sen, spe) array first via average=None, derives the per-class G-mean array, and aggregates with np.mean (macro) or np.average weighted by class support (weighted). This makes geometric_mean_score(y_true, y_pred, average='macro') equal to np.mean(geometric_mean_score(y_true, y_pred, average=None)). binary / micro / None branches are unchanged. Existing tests with stale expected values (computed against the buggy code) are updated with comments explaining the math; a new parametrised non-regression test asserts the macro-equals-mean invariant on three input shapes. Co-Authored-By: Claude Code <noreply@anthropic.com>

glemaitre · 2026-06-07T19:14:21Z

+        # For macro/weighted averaging the previous implementation passed
+        # ``average`` directly to sensitivity_specificity_support, which
+        # returns the macro/weighted average of sensitivity and
+        # specificity, and then took ``sqrt(sen * spe)`` of those scalars.
+        # That is the geometric mean of mean-sensitivity and
+        # mean-specificity, NOT the mean of per-class G-means — and the two
+        # disagree on binary inputs (see #1096). Compute the per-class
+        # G-mean first and aggregate, so that
+        # ``geometric_mean_score(..., average='macro')`` equals
+        # ``np.mean(geometric_mean_score(..., average=None))`` as users
+        # rightly expect.


Can you remove this comment.

glemaitre · 2026-06-07T19:14:55Z

+            # weighted: mean weighted by support (true samples per class).
+            # Mirror sklearn's behaviour and return 0 when the total
+            # support is zero rather than producing a NaN from 0/0.


Also no need for this.

glemaitre · 2026-06-07T19:16:08Z

+    # Non-regression test for #1096: the macro G-mean MUST equal the
+    # arithmetic mean of the per-class G-mean array. Before #1096 the
+    # macro path returned sqrt(macro_sen * macro_spe), which generally
+    # disagrees with mean(per_class_gmean) — most visibly on binary
+    # inputs, where the original report observed 0.745 (correct ~0.577)
+    # for y_true=[0,0,1,0,1,1], y_pred=[0,0,0,0,0,1].


You don't need all this text. You only need to mention that it is a non-regression test and link the PR (because we did not have an issue)

glemaitre · 2026-06-07T19:16:20Z

+        # macro / weighted updated in #1096: now mean(per-class gmean)
+        # rather than sqrt(macro_sen * macro_spe). Per-class G-means
+        # for this fixture are [0.82, 0.24, 0.72]; macro = mean = 0.59;
+        # weighted mean over class supports (which differ slightly due
+        # to the 50/50 train/test split) = 0.55.


remove this comment

glemaitre · 2026-06-07T19:16:39Z

+        # weighted G-mean with labels=[0, 1] and the given sample
+        # weights: per-class (sen, spe) restricted to [0, 1] are
+        # [(1.0, 0.5), (0.5, 0.0)] with supports [2, 4]; per-class
+        # G-means = [sqrt(0.5), 0] = [0.707, 0]; weighted average =
+        # (2*0.707 + 4*0)/6 = 0.236. Old expected 0.333 came from
+        # sqrt(weighted_sen * weighted_spe), the bug fixed in #1096.


Remove this comment

glemaitre · 2026-06-07T19:16:48Z

+        # Macro / weighted G-mean is the (weighted) mean of the per-class
+        # G-mean array — which for this input is [0.866, 0, 0]. Their
+        # value is therefore 0.866/3 = 0.2887, NOT sqrt(macro_sen *
+        # macro_spe) = 0.471 as the metric used to return before #1096.


Remove this comment.

immu4989 · 2026-06-07T21:35:54Z

I reviewed this against #1096 and the alternative PR #1164, and ran the suite locally — this is the correct and complete fix. A few notes for the maintainers:

The bug is broader than the original report suggested. #1096 framed it as binary-only and said multiclass "works correctly", but the macro path is actually wrong for multiclass too — just by a smaller margin that's easy to miss. On the report's own multiclass fixture:

y_true = [0, 1, 2, 0, 1, 2]; y_pred = [0, 2, 1, 0, 1, 2]
geometric_mean_score(y_true, y_pred, average=None)    # [1. , 0.612, 0.612]
np.mean(...)                                           # 0.7416   <- expected
geometric_mean_score(y_true, y_pred, average="macro")  # 0.7454   <- master (wrong)

This PR's approach — compute the per-class G-means and aggregate (np.mean for macro, support-weighted for weighted) — fixes both binary and multiclass in one path, because it removes the root cause (sqrt(macro_sen * macro_spe) ≠ mean(sqrt(sen*spe))) rather than special-casing it. That's the right call; PR #1164 only branches on is_binary and leaves the multiclass macro value wrong.

Things I checked and that look right:

average="micro" is untouched and still routes through the old path — correct, since the issue is specific to macro/weighted.
The weighted branch guards sup.sum() == 0 -> 0.0, avoiding a 0/0 NaN; matches sklearn's convention.
The updated docstring/expected values and the test_geometric_mean_macro_equals_mean_of_per_class invariant test are the right property to lock in.
Locally, pytest imblearn/metrics/tests/test_classification.py -k geometric → 23 passed.

One small suggestion: the weighted average is weighted by sup (true samples per class). Worth a one-line code comment stating that explicitly, since "weighted" could otherwise be read as weighting by predicted frequency — but this matches the per-class semantics, so it's correct as written.

Nice, thorough fix with a genuine invariant test. 👍

jbbqqf and others added 2 commits May 9, 2026 20:12

DOC update changelog with assigned PR number

1ff9523

glemaitre reviewed Jun 7, 2026

View reviewed changes

immu4989 mentioned this pull request Jun 7, 2026

Fix macro average calculation in geometric_mean_score for binary classification #1164

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MRG] FIX geometric_mean_score: macro/weighted average is mean of per-class G-means (#1096)#1173

[MRG] FIX geometric_mean_score: macro/weighted average is mean of per-class G-means (#1096)#1173
jbbqqf wants to merge 2 commits into
scikit-learn-contrib:masterfrom
jbbqqf:feat/1096-fix-gmean-macro-binary

jbbqqf commented May 9, 2026

Uh oh!

glemaitre Jun 7, 2026

Uh oh!

glemaitre Jun 7, 2026

Uh oh!

glemaitre Jun 7, 2026

Uh oh!

glemaitre Jun 7, 2026

Uh oh!

glemaitre Jun 7, 2026

Uh oh!

glemaitre Jun 7, 2026

Uh oh!

immu4989 commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jbbqqf commented May 9, 2026

Reference Issue

What does this implement/fix? Explain your changes.

Reproduce BEFORE/AFTER yourself (copy-paste)

What I ran locally

Edge cases tested

Risk / blast radius

Any other comments?

Uh oh!

glemaitre Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

glemaitre Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

glemaitre Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

glemaitre Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

glemaitre Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

glemaitre Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

immu4989 commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants