Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: spatial: faster Chebyshev distance #20570

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion scipy/spatial/distance.py
Original file line number Diff line number Diff line change
Expand Up @@ -1077,7 +1077,7 @@ def chebyshev(u, v, w=None):
if has_weight.sum() < w.size:
u = u[has_weight]
v = v[has_weight]
return max(abs(u - v))
return np.max(abs(u - v))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really only making things faster for high-dimensionality data, which may not really be the primary use case (3 dimensions are pretty common obviously). Most of the workhorse usage of the distance metrics probably happens through pdist and cdist where multiple points can be compared.

In fact, our formal asv benchmarks probably reflect this PR being a step back for the common low-dimensions scenario. Need to check with asv continuous -e -b "SingleDist.*" main enh_20561 or similar (wasn't working for me locally.. need to open an issue for that..).

If the argument is performance, then perhaps the middle-ground approach combined with adjustment of our asv benchmarks to probe performance in both dimensionality regimes would make sense.

There may be some array API argument for xp.max approach. But then I'm not sure we should really be framing this as performance focused, since > 40 dimensions is quite a lot for practical use I think. If other devs are happy with the array API argument then I probably care less about benchmarking with regular NumPy arrays, but I do want us to be clear on the purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not surprising, but here is the current output of asv continuous -e -v -b "SingleDist.*" main enh_20561 showing slower performance. This probably isn't worth too much debate over a single line change, but maybe folks should pick a reason to go in one direction or another.

| Change   | Before [85736e16] <main>   | After [91770f5c] <enh_20561>   |   Ratio | Benchmark (Parameter)                                      |
|----------|----------------------------|--------------------------------|---------|------------------------------------------------------------|
| +        | 1.47±0.01μs                | 2.22±0.01μs                    |    1.51 | spatial.SingleDist.time_dist('chebyshev')                  |
| +        | 5.27±0.02μs                | 6.06±0.01μs                    |    1.15 | spatial.SingleDistWeighted.time_dist_weighted('chebyshev') |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise, avoid us merging this and then in 6 months someone comes along and reverts it, citing this benchmark in our suite. So maybe a comment explaining tradeoffs if folks agree on some reason for it.



def braycurtis(u, v, w=None):
Expand Down