Skip to content

Commit cf4758f

Browse files
sebasvmroeschkeJMBurley
authored
opt out of bottleneck for nanmean (#47716)
* opt out of bottleneck for nanmean * remove trailing whitespace * make error bound explicit * unittest only _bn_ok_dtype * link issue to test function * Update doc/source/whatsnew/v1.5.0.rst clarify that there might be a performance decrease experienced from disabling `mean` for bottleneck Co-authored-by: Matthew Roeschke <[email protected]> * extend unit tests with (u)int dtypes * Update pandas/core/nanops.py Co-authored-by: JMBurley <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]> Co-authored-by: JMBurley <[email protected]>
1 parent 250d971 commit cf4758f

File tree

3 files changed

+28
-2
lines changed

3 files changed

+28
-2
lines changed

doc/source/whatsnew/v1.5.0.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -846,7 +846,7 @@ Numeric
846846
- Bug in operations with array-likes with ``dtype="boolean"`` and :attr:`NA` incorrectly altering the array in-place (:issue:`45421`)
847847
- Bug in division, ``pow`` and ``mod`` operations on array-likes with ``dtype="boolean"`` not being like their ``np.bool_`` counterparts (:issue:`46063`)
848848
- Bug in multiplying a :class:`Series` with ``IntegerDtype`` or ``FloatingDtype`` by an array-like with ``timedelta64[ns]`` dtype incorrectly raising (:issue:`45622`)
849-
-
849+
- Bug in :meth:`mean` where the optional dependency ``bottleneck`` causes precision loss linear in the length of the array. ``bottleneck`` has been disabled for :meth:`mean` improving the loss to log-linear but may result in a performance decrease. (:issue:`42878`)
850850

851851
Conversion
852852
^^^^^^^^^^

pandas/core/nanops.py

+5-1
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,10 @@ def f(
162162
def _bn_ok_dtype(dtype: DtypeObj, name: str) -> bool:
163163
# Bottleneck chokes on datetime64, PeriodDtype (or and EA)
164164
if not is_object_dtype(dtype) and not needs_i8_conversion(dtype):
165+
# GH 42878
166+
# Bottleneck uses naive summation leading to O(n) loss of precision
167+
# unlike numpy which implements pairwise summation, which has O(log(n)) loss
168+
# crossref: https://github.com/pydata/bottleneck/issues/379
165169

166170
# GH 15507
167171
# bottleneck does not properly upcast during the sum
@@ -171,7 +175,7 @@ def _bn_ok_dtype(dtype: DtypeObj, name: str) -> bool:
171175
# further we also want to preserve NaN when all elements
172176
# are NaN, unlike bottleneck/numpy which consider this
173177
# to be 0
174-
return name not in ["nansum", "nanprod"]
178+
return name not in ["nansum", "nanprod", "nanmean"]
175179
return False
176180

177181

pandas/tests/test_nanops.py

+22
Original file line numberDiff line numberDiff line change
@@ -1120,3 +1120,25 @@ def test_check_below_min_count__large_shape(min_count, expected_result):
11201120
shape = (2244367, 1253)
11211121
result = nanops.check_below_min_count(shape, mask=None, min_count=min_count)
11221122
assert result == expected_result
1123+
1124+
1125+
@pytest.mark.parametrize("func", ["nanmean", "nansum"])
1126+
@pytest.mark.parametrize(
1127+
"dtype",
1128+
[
1129+
np.uint8,
1130+
np.uint16,
1131+
np.uint32,
1132+
np.uint64,
1133+
np.int8,
1134+
np.int16,
1135+
np.int32,
1136+
np.int64,
1137+
np.float16,
1138+
np.float32,
1139+
np.float64,
1140+
],
1141+
)
1142+
def test_check_bottleneck_disallow(dtype, func):
1143+
# GH 42878 bottleneck sometimes produces unreliable results for mean and sum
1144+
assert not nanops._bn_ok_dtype(dtype, func)

0 commit comments

Comments
 (0)