Skip to content

Conversation

quentinblampey
Copy link

Hello @flying-sheep @Zethson,

As mentioned in #112, I added support for min/max, but I'm unsure about my implementation, notably regarding code quality:

  • The code is highly similar to the sum function, but it has some minor differences due to the fact that the min/max operations don't use the dtype argument. I think I could potentially make a function that is generic enough to handle all these "simple" operations at once, but I'm worried it would become too abstract and complex to read/maintain. What do you think?
  • Two functions (normalize_axis and get_shape) are shared for sum and min/max, so I wanted to check with you where we should move them. E.g., I can create stats/_utils.py?
  • The docs and tests are still missing, but I want to first check with you the above point and then I'll add these

Closes #112

Copy link

codecov bot commented Aug 27, 2025

Codecov Report

❌ Patch coverage is 94.73684% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.10%. Comparing base (940727f) to head (2aa8bd8).

Files with missing lines Patch % Lines
src/fast_array_utils/stats/_utils.py 92.00% 4 Missing ⚠️
src/fast_array_utils/stats/_generic_ops.py 93.75% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #116      +/-   ##
==========================================
- Coverage   98.33%   96.10%   -2.24%     
==========================================
  Files          17       19       +2     
  Lines         421      462      +41     
==========================================
+ Hits          414      444      +30     
- Misses          7       18      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

codspeed-hq bot commented Aug 27, 2025

CodSpeed Performance Report

Merging #116 will not alter performance

Comparing quentinblampey:minmax (2aa8bd8) with main (940727f)

Summary

✅ 160 untouched
🆕 72 new

Benchmarks breakdown

Benchmark BASE HEAD Change
🆕 test_stats_benchmark[numpy.ndarray-1d-all-float32-max] N/A 2.1 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-1d-all-float32-min] N/A 2.1 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-1d-all-float64-max] N/A 4.1 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-1d-all-float64-min] N/A 4.1 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-1d-all-int32-max] N/A 1.6 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-1d-all-int32-min] N/A 1.6 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-2d-all-float32-max] N/A 2.1 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-2d-all-float32-min] N/A 2.1 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-2d-all-float64-max] N/A 4.1 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-2d-all-float64-min] N/A 4.1 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-2d-all-int32-max] N/A 1.6 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-2d-all-int32-min] N/A 1.6 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-2d-ax0-float32-max] N/A 2.3 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-2d-ax0-float32-min] N/A 2.3 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-2d-ax0-float64-max] N/A 4.5 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-2d-ax0-float64-min] N/A 4.5 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-2d-ax0-int32-max] N/A 2.2 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-2d-ax0-int32-min] N/A 2.2 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-2d-ax1-float32-max] N/A 2.2 ms N/A
🆕 test_stats_benchmark[numpy.ndarray-2d-ax1-float32-min] N/A 2.2 ms N/A
... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

@flying-sheep
Copy link
Member

Hi! Thanks for the PR!

I did some of the deduplication, but all function bodies are identical except for the existence/non-existence of dtype and the actual numpy function that’s being called. I’d therefore implement similar to here:

https://github.com/quentinblampey/fast-array-utils/blob/15347f7c1d741de4e867790c430b8c7058f049d9/src/fast_array_utils/stats/__init__.py#L273-L291

but I can do that later.

@quentinblampey
Copy link
Author

quentinblampey commented Oct 6, 2025

Hi @flying-sheep,
Sorry for the delay, I finished the deduplication to handle the dtype correctly

A few comments:

  • I saw that for the sum on sparse arrays, you were using x.data.sum instead of x.sum. Is there a reason for this? I'm assuming scipy already use x.data internally for the sum, but I wanted to check with you because the new implementation uses x directly (else we would have issues, e.g. for the min when x.data only has positive values wouldn't return 0).
  • How to document these new functions now that we have made a generic builder? Directly updating .__doc__?
  • For the sum with dtype=int, since it applies it on chunks, the results were a little bit unexpected for me at first. I understand what it does and why we have this result, but it's a little bit unintuitive, because it's not consistent with direct dask usage and not consistent with rounding post float-sum either.
>>> import dask.array as da
>>> from fast_array_utils import stats
>>> x = da.random.random((1000, 1000), chunks=(100, 100))
>>> stats.sum(x).compute(), stats.sum(x, dtype=int).compute()

(np.float64(500227.801967892), np.int64(500173)) # large difference

>>> x.sum(dtype=int).compute()
0 # convert to int before summing

@flying-sheep
Copy link
Member

flying-sheep commented Oct 13, 2025

Hi! I’m also responding late, since I was on a hackathon last week!

  • I saw that for the sum on sparse arrays, you were using x.data.sum instead of x.sum. Is there a reason for this?

Probably not, I assume that x.sum will do the right thing.

  • How to document these new functions now that we have made a generic builder?

maybe no better way than to go back to manual @overloads (maybe in a if TYPE_CHECKING block)

  • For the sum with dtype=int, since it applies it on chunks, the results were a little bit unexpected for me at first.

What numpy does is this:

dtype: The type of the returned array and of the accumulator in which the elements are summed.

ah, looks like we don’t test stats.sum(<array dtype=np.float*>, dtype=np.int*) properly. For other combinations we test that things behave like numpy does.

I guess that’s a bug then! #124

@flying-sheep
Copy link
Member

OK, new info here:

so we’re also circumventing the bug for scipy, which makes the x.data.sum necessary for performance reasons.

@quentinblampey
Copy link
Author

Thanks for the in-depth analysis @flying-sheep, and good catch for the scipy bug!

@flying-sheep
Copy link
Member

flying-sheep commented Oct 14, 2025

once the other PR is merged, I think we can progress here!

Except for the overloads: I still don’t have a good idea on how to get them both into the docs and the typing without repeating ourselves.

@flying-sheep flying-sheep added the run-gpu-ci Apply this label to run GPU CI once label Oct 15, 2025
@flying-sheep
Copy link
Member

OK, I think I merged it all in, now we just need to fix mypy and the docs and this should be ready!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-gpu-ci Apply this label to run GPU CI once

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support min/max

2 participants