Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

np.histogram for cubes #5902

Open
schlunma opened this issue Apr 3, 2024 · 8 comments
Open

np.histogram for cubes #5902

schlunma opened this issue Apr 3, 2024 · 8 comments
Labels
Feature: ESMValTool Feature: Statistics Label for reduction-like operations e.g., collapsing, aggregrating, rolling-window

Comments

@schlunma
Copy link
Contributor

schlunma commented Apr 3, 2024

✨ Feature Request

I am currently working on an ESMValTool preprocessor that calculates histograms from cubes along given coordinates similar to np.histogram. I think this would also be a nice fit to iris in the iris.analysis.stats module. Here is a possible call signature:

def histogram(
    cube: Cube,
    coords: Iterable[Coord] | Iterable[str] | None = None,
    bins: int | Sequence[float] = 10,
    bin_range: tuple[float, float] | None = None,
    weights: np.ndarray | da.Array | None = None,
    normalization: Literal['sum', 'integral'] | None = None,
) -> Cube:

This function should fully support lazy and/or masked data. If this is considered relevant for iris, I can open a PR (already have some code for this).

Motivation

Calculating histograms is a common task in geosciences.

@schlunma schlunma added Feature: ESMValTool Feature: Statistics Label for reduction-like operations e.g., collapsing, aggregrating, rolling-window labels Apr 3, 2024
@HGWright
Copy link
Contributor

@SciTools/peloton Thanks @schlunma for this suggestion. We are curious about your use case, could you share this with us and why you require this additional Iris specific functionality rather than just using numpy?

@schlunma
Copy link
Contributor Author

The numpy and dask versions will always collapse the entire array; there is now way of calculating the histogram along one or more specified axes.

However, this is exactly what I need for my particular use case: I want to calculate a metric called Earth mover's distance across different coordinates. The default numpy and dask histograms would only allow me to calculate that metric across the entire dataset. More details can be found in the corresponding ESMValCore PR, in there you can also find working code.

@rcomer
Copy link
Member

rcomer commented Apr 10, 2024

There is an open issue in numpy about adding the axis keyword to histogram. Would it be worth trying to push that forward first?
numpy/numpy#13166

@schlunma
Copy link
Contributor Author

Thanks for the link @rcomer! Especially the xhistogram package looks super relevant; unfortunately, it looks like it's not maintained anymore. Getting the function into numpy would not be enough for me, since I also need a dask version of it (and from the comments in the linked issue, it also seems that this is not trivial at all if one wants to do that properly). On the other hand, my current solution is quite simple and just relies on np.vectorize (which makes it slower, but the performance is ok).

I am also completely fine to include this into ESMValCore, so we can close this if this is not relevant for you.

@trexfeathers
Copy link
Contributor

I am also completely fine to include this into ESMValCore, so we can close this if this is not relevant for you.

I reckon further discussion first, now we have a more detailed use case.

@pp-mo
Copy link
Member

pp-mo commented Apr 11, 2024

ESMValCore ... if this is not relevant for you.

For me the key question here is : ? what is the point of making this function of a Cube, rather than just an operation on an array, calc(array, over_axes=None, n_bins=10, bins=None) ?

It could be that the coords add some validity to operation, or that a Cube with a 'value_bins' dimension is itself useful. Perhaps iris.plot has a role.
But so far I haven't got the killer need : why isn't this just a piece of maths ?

@schlunma
Copy link
Contributor Author

I don't have the killer argument for this; I guess it's just nicer to have this work with labeled dimensions instead of axes and include proper metadata handling. For my specific use case, it would also be totally fine to have this work with arrays.

However, your argumentation could also be applied to most mathematical operations in iris, right? For example, why do you have cube.collapsed(coords, iris.analysis.MEAN) when you could do array.mean(axis=...)? Why do you support cube1 + cube2 when you could simply do array1 + array2?

@pp-mo
Copy link
Member

pp-mo commented Apr 12, 2024

I don't have the killer argument for this; I guess it's just nicer to have this work with labeled dimensions instead of axes and include proper metadata handling. For my specific use case, it would also be totally fine to have this work with arrays.
However, your argumentation could also be applied to most mathematical operations in iris, right? For example, why do you have cube.collapsed(coords, iris.analysis.MEAN) when you could do array.mean(axis=...)? Why do you support cube1 + cube2 when you could simply do array1 + array2?

Totally, it's a judgement thing.
But to your specific examples, statistics and arithmetic do both contain useful metadata handling, to modify cell-methods and units.

In this case, I guess the result cube would always have a count or frequency identity, so probably a long-name and units of '1'.
AFAICT there aren't really any useful CF concepts that we could apply here, though.
I guess we'd like to be able to have a cube like "frequency of air_temperature" with units like "frequency", "fraction" or "count", but such things are currently out-of-scope -- there isn't even a standard "extension" attributes for non-standard units, like 'long_name' is so often used.
Likewise, a cell-method might make sense to describe the dimensions over which the operation was applied. But again, it would need an extension to the standardised forms, e.g. "histogram over time".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: ESMValTool Feature: Statistics Label for reduction-like operations e.g., collapsing, aggregrating, rolling-window
Projects
Status: No status
Development

No branches or pull requests

5 participants