-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Micro optimizations to improve indexing #9002
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
xarray/core/indexing.py
Outdated
@@ -383,11 +383,14 @@ class BasicIndexer(ExplicitIndexer): | |||
|
|||
__slots__ = () | |||
|
|||
def __init__(self, key: tuple[int | np.integer | slice, ...]): | |||
def __init__(self, key: tuple[int | np.integer | slice, ...], *, _fastpath=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we generally use fastpath
rather than _fastpath
, though no strong view. We could also have a different method from_key
given there isn't much overlap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, happy to change these variable names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i'm thinking through some of the implications here, lets review at the end of the day.
d975fba
to
c2a065d
Compare
xarray/core/indexes.py
Outdated
return self.__id_coord_names | ||
|
||
def _create_id_coord_names(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its strange that the function _create_id_coord_names
gets hit all the time by the profiler. I would have thought that once the dataset is created this function would only be called once.
8c49aba
to
2b5e936
Compare
@hmaarrfk Sorry Marc, moving to ready for review was by accident. Still have problems to properly use GitHub Android app. |
the github app is terrible..... i keep making similar mistakes on my phone. |
5d4e9b2
to
f958953
Compare
Ok enough for today. But generally speaking, I think the the main problem, large compatibility with Pandas is a known issue. However, I think that some things are simply strange in the sense that certain You can see some of the numbers that I use. I use a The function It drops the performance on my Threadripper 2950x from 14kits/sec to 12kits/sec. still an improvement from the 9kits/sec where this pull request #9001 leaves off. |
2c202ff
to
1a659bc
Compare
feel free to push any cleanup.s |
@@ -1800,7 +1844,10 @@ def isel_indexes( | |||
indexes: Indexes[Index], | |||
indexers: Mapping[Any, Any], | |||
) -> tuple[dict[Hashable, Index], dict[Hashable, Variable]]: | |||
return _apply_indexes(indexes, indexers, "isel") | |||
if any(isinstance(v, PandasMultiIndex) for v in indexes._indexes.values()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: this is the error that shows up without this if statement:
def _assert_indexes_invariants_checks(
indexes, possible_coord_variables, dims, check_default=True
):
assert isinstance(indexes, dict), indexes
assert all(isinstance(v, Index) for v in indexes.values()), {
k: type(v) for k, v in indexes.items()
}
index_vars = {
k for k, v in possible_coord_variables.items() if isinstance(v, IndexVariable)
}
assert indexes.keys() <= index_vars, (set(indexes), index_vars)
# check pandas index wrappers vs. coordinate data adapters
for k, index in indexes.items():
if isinstance(index, PandasIndex):
pd_index = index.index
var = possible_coord_variables[k]
assert (index.dim,) == var.dims, (pd_index, var)
if k == index.dim:
# skip multi-index levels here (checked below)
assert index.coord_dtype == var.dtype, (index.coord_dtype, var.dtype)
assert isinstance(var._data.array, pd.Index), var._data.array
# TODO: check identity instead of equality?
assert pd_index.equals(var._data.array), (pd_index, var)
if isinstance(index, PandasMultiIndex):
pd_index = index.index
for name in index.index.names:
assert name in possible_coord_variables, (pd_index, index_vars)
var = possible_coord_variables[name]
assert (index.dim,) == var.dims, (pd_index, var)
assert index.level_coords_dtype[name] == var.dtype, (
index.level_coords_dtype[name],
var.dtype,
)
assert isinstance(var._data.array, pd.MultiIndex), var._data.array
assert pd_index.equals(var._data.array), (pd_index, var)
# check all all levels are in `indexes`
assert name in indexes, (name, set(indexes))
# index identity is used to find unique indexes in `indexes`
> assert index is indexes[name], (pd_index, indexes[name].index)
E AssertionError: (MultiIndex([('a', 1, -1),
E ('b', 2, -2)],
E name='x'), MultiIndex([('a', 1, -1),
E ('b', 2, -2)],
E name='x'))
/home/mark/git/xarray/xarray/testing/assertions.py:411: AssertionError
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PDB post_mortem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> /home/mark/git/xarray/xarray/testing/assertions.py(411)_assert_indexes_invariants_checks()
-> assert index is indexes[name], (pd_index, indexes[name].index)
(Pdb) c
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PDB continue >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
======================================== short test summary info =========================================
FAILED xarray/tests/test_dataarray.py::TestDataArray::test_selection_multiindex - AssertionError: (MultiIndex([('a', 1, -1),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like a copy was made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i just don't know enough about the internal datastructures of xarray and ran out of time hunting for it.
Maybe there is a different way of iterating through things.
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
Finally on my machine, this code shows that there is a 30 us /it -> 25 us /it improvement on my little benchmark from 5 years ago -- #2799 (comment) import xarray as xr
import numpy as np
n = np.zeros(shape=(1024, 1024))
x = xr.DataArray(n, dims=('y', 'x'))
the_slice = np.s_[256:512, 256:512]
%timeit n[the_slice]
%timeit x[the_slice] |
index = safe_cast_to_index(array).copy() | ||
def __init__( | ||
self, array: Any, dim: Hashable, coord_dtype: Any = None, *, fastpath=False | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit wary of adding fastpath
everywhere. It makes things a bit hard to reason about. Is this one that consequential? We do first check for isinstance(pd.Index)
in safe_cast_to_index
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anecdotally, I remember thinking that indexes are created and copied a few too many times. I believe this was done for correctness in a big refactor, but reducing any useless copies would be a big win since we would call _replace
less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anecdotally, I remember thinking that indexes are created and copied a few too many times.
this in general is my feeling after my repeated deep diving into the codebase.
@@ -1772,6 +1785,37 @@ def check_variables(): | |||
return not not_equal | |||
|
|||
|
|||
def _apply_indexes_fast(indexes: Indexes[Index], args: Mapping[Any, Any], func: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+10 for this kind of refactor ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! but what aspect? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewriting the core logic to be faster :) There's definitely a bunch of low hanging fruit
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
xarray/core/dataarray.py
Outdated
# Variables can have undefined data type according to MyPy due to the fastpath | ||
self._coords = cast(dict, coords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mypy is correct to complain here. Set fastpath=True
it becomes strange to not have the assert.
Remember that coords: Sequence[Sequence | pd.Index | DataArray] | Mapping | None = None
at this point and self._coords
must be a dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok i added back the asserts. they did not have a significant effect on the benchmark I was running.
xarray/core/dataarray.py
Outdated
assert dims is None | ||
assert attrs is None | ||
assert indexes is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when you set values other than None here now? Is there a test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understood that fastpath
indiciated:
- I know what i'm doing, please avoid using expensive type checking.
Do we need more safety measures?
I feel like runtime checks should not need to be added to appease a linter. I under there is an environment variable that makes assert
into a no-op, but I feel like we can avoid these lines of code if we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok i added back the asserts. they did not have a significant effect on the benchmark I was running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up reverting the changes to this method. thanks for the review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sorry I made the changes on a local branch and forgot to push)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you on the performance cost of isinstance
. But if you want fast and correct code it's mypy that you should heavily rely on. Mypy won't be useful however if using overrides like cast
and type:ignore
.
While reading up on the array api standard I've come to prefer the approach that Array
classes should have the exact required type for the input arguments, and use xp.asarray
to handle all the other wider types. This way we separate the concerns in a familiar way and it significantly reduces the number of required isinstance
checks internally because we can trust that the arguments always are correct.
DataArray
unfortunately took the other route when these kinds of performance bottlenecks wasn't as big of a priority and now we have to deal with the backwards compatibility concerns.
I'm converting to draft since I need to re-assess where things land with the changes in: I think I incorporated many of the from |
002c7e8
to
4081568
Compare
Timing:
Benchmarking code
|
If you all think that: https://github.com/pydata/xarray/pull/9002/files#r1590443756 should hold up this PR then I'm happy to review, but I think that this is a step in the right direction |
if isinstance(applied_indexer, slice) and applied_indexer == slice(None): | ||
# shortcut for the usual case | ||
return old_indexer | ||
if isinstance(old_indexer, slice): | ||
if isinstance(applied_indexer, slice): | ||
indexer = slice_slice(old_indexer, applied_indexer, size) | ||
elif isinstance(applied_indexer, integer_types): | ||
indexer = range(*old_indexer.indices(size))[applied_indexer] # type: ignore[assignment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These kinds of # type: ignore[assignment]
errors usually means indexer
has changed type, which usually leads to requiring another costly isinstance
again somewhere downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is due to the ambiguity between slice and int for these indexers. I can look into it if you want.
However. It starts to touch some tuple to lists conversation that was identified as something I shouldn't adjust too much due to potential conflicts with an active in development branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following patch would
diff --git a/xarray/core/indexing.py b/xarray/core/indexing.py
index e29d77da..44560dbd 100644
--- a/xarray/core/indexing.py
+++ b/xarray/core/indexing.py
@@ -9,7 +9,7 @@ from contextlib import suppress
from dataclasses import dataclass, field
from datetime import timedelta
from html import escape
-from typing import TYPE_CHECKING, Any, Callable, overload
+from typing import TYPE_CHECKING, Any, Callable, overload, List
import numpy as np
import pandas as pd
@@ -295,20 +295,24 @@ def slice_slice(old_slice: slice, applied_slice: slice, size: int) -> slice:
return slice(start, stop, step)
-def _index_indexer_1d(old_indexer, applied_indexer, size: int):
+def _index_indexer_1d(
+ old_indexer: slice | np.ndarray[Any, np.dtype[np.generic]],
+ applied_indexer: slice | int | np.integer | np.ndarray[Any, np.dtype[np.generic]],
+ size: int
+) -> slice | int | np.integer | np.ndarray[Any, np.dtype[np.generic]]:
+ """Helper routine to slice non integers into an entry for OuterIndexer"""
if isinstance(applied_indexer, slice) and applied_indexer == slice(None):
# shortcut for the usual case
return old_indexer
if isinstance(old_indexer, slice):
if isinstance(applied_indexer, slice):
- indexer = slice_slice(old_indexer, applied_indexer, size)
+ return slice_slice(old_indexer, applied_indexer, size)
elif isinstance(applied_indexer, integer_types):
- indexer = range(*old_indexer.indices(size))[applied_indexer] # type: ignore[assignment]
+ return range(*old_indexer.indices(size))[applied_indexer]
else:
- indexer = _expand_slice(old_indexer, size)[applied_indexer]
+ return _expand_slice(old_indexer, size)[applied_indexer]
else:
- indexer = old_indexer[applied_indexer]
- return indexer
+ return old_indexer[applied_indexer]
class ExplicitIndexer:
@@ -630,12 +634,14 @@ class LazilyIndexedArray(ExplicitlyIndexedNDArrayMixin):
def _updated_key(self, new_key: ExplicitIndexer) -> BasicIndexer | OuterIndexer:
iter_new_key = iter(expanded_indexer(new_key.tuple, self.ndim))
- full_key = []
+ full_key : List[slice | int | np.integer | np.ndarray[Any, np.dtype[np.generic]]] = []
+ outer_indexer = True
for size, k in zip(self.array.shape, self.key.tuple):
if isinstance(k, integer_types):
full_key.append(k)
else:
full_key.append(_index_indexer_1d(k, next(iter_new_key), size))
+
full_key_tuple = tuple(full_key)
if all(isinstance(k, integer_types + (slice,)) for k in full_key_tuple):
would move the error to
xarray/core/indexing.py:648: error: Argument 1 to "BasicIndexer" has incompatible type "tuple[slice | int | integer[Any] | ndarray[Any, dtype[generic]], ...]"; expected "tuple[int | integer[Any] | slice, ...]" [arg-type]
While I don't disagree that:
which usually leads to requiring another costly isinstance again somewhere downstream.
I would rather keep this for a future change once the branch backend-indexing
has been merge in.
Sigh, as much as I hate giving myself more work, I want to run the xarray benchmarks before and after and ensure that there is a measurable difference with the existing benchmarks, if there isn't I would like to contribute a benchmark for this. While I know that benchmarks aren't always blockers for future changes, I did come here because I felt like indexing xarray had become too slow for our application so I would like to build off your monitoring tools. |
I needed this change on the main branch to get asv to rundiff --git a/asv_bench/asv.conf.json b/asv_bench/asv.conf.json
index 9dc86df7..4c717b7d 100644
--- a/asv_bench/asv.conf.json
+++ b/asv_bench/asv.conf.json
@@ -29,7 +29,7 @@
// If missing or the empty string, the tool will be automatically
// determined by looking for tools on the PATH environment
// variable.
- "environment_type": "mamba",
+ "environment_type": "conda",
"conda_channels": ["conda-forge"],
// timeout in seconds for installing any dependencies in environment
It seems that the existing benchmarks are adequate enough to generate the signal we need. |
These are some micro optimizaitons to improve indexing within xarray objects.
Generally speaking, the use of
isinstance
in python is pretty slow. Its pretty hard to avoid with the amount of "niceness" that xarray does.However, I feel like these changes are "ok" and kick ups the performance of
#9001
from
to
With my "controverisal" fix for
_apply_indexes_fast
:without it:
a nice 40% boost!
Benchmarked on
its not the best processor, but it doesn't throttle!
I'll mark this as not draft when it is ready for review.done!Concepts used:
LazilyIndexedArray
now store the shape at creation time. Otherwise this was causing the shape to be recomputed many times during indexing operations that use operators likelen
orshape
to broadcast the indexing tuple._id_coord_names
by avoidiningxindexes.group_by_index()
giving a 20% speedup.Findings
_id_coord_names
has terrible performance... I had to avoid it to get isel to speed up. It is likely that other code paths can be sped up the same way too. I just don't know how to use those code paths so i'm hesitant to go down that rabbit hole myself.whats-new.rst
api.rst
xref: #2799
xref: #7045