-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Micro optimize dataset.isel for speed on large datasets #9003
base: main
Are you sure you want to change the base?
Conversation
021ba45
to
9128c7c
Compare
I'm happy to add benchmarks for these if you think it would help. That said. I would love to leave that addition for future work. My time spent playing on this kind of speedup is up for the week. |
Thanks. Do you see any changes in our asv benchmarks in We'd be happy to take updates for those too :) |
I didn't get to running The speedups here are more associated with:
I don't think this benchmark exists at quick glance. I could create one |
# Fastpath, skip all of this for variables with no dimensions | ||
# Keep the result cached for future dictionary update | ||
elif var_dims := var.dims: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Fastpath, skip all of this for variables with no dimensions | |
# Keep the result cached for future dictionary update | |
elif var_dims := var.dims: | |
elif var.ndim == 0: | |
continue | |
else: |
Does this work
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no wait, i spoke too soon. i had a typo. oddly, it is slower...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diff --git a/xarray/core/dataset.py b/xarray/core/dataset.py
index ec756176..4e8c31e5 100644
--- a/xarray/core/dataset.py
+++ b/xarray/core/dataset.py
@@ -2987,22 +2987,20 @@ class Dataset(
if name in index_variables:
var = index_variables[name]
dims.update(zip(var.dims, var.shape))
- # Fastpath, skip all of this for variables with no dimensions
- # Keep the result cached for future dictionary update
- elif var_dims := var.dims:
+ elif var.ndim == 0:
+ continue
+ else:
# Large datasets with alot of metadata may have many scalars
# without any relevant dimensions for slicing.
# Pick those out quickly and avoid paying the cost below
# of resolving the var_indexers variables
- if var_indexer_keys := all_keys.intersection(var_dims):
+ if var_indexer_keys := all_keys.intersection(var.dims):
var_indexers = {k: indexers[k] for k in var_indexer_keys}
var = var.isel(var_indexers)
if drop and var.ndim == 0 and name in coord_names:
coord_names.remove(name)
continue
- # Update our reference to `var_dims` after the call to isel
- var_dims = var.dims
- dims.update(zip(var_dims, var.shape))
+ dims.update(zip(var.dims, var.shape))
variables[name] = var
return self._construct_direct(
was slower.... this is somewhat unexpected. ndim should be "instant".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me add a benchmark tonight to "show" that this is the better way explicitely, otherwise it will be too easy to undo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My conclusion is that:
len(tuple)
seems to be pretty fast.- But the
.shape
attribute is only resolved after 4-5 different python indirections going down to aLazilyIndexedArray
,MemoryCachedArray
,H5BackedArray
(sorry, i'm not getting the class names right), but ultimately it isn't "readily available and needs to be resolved.
My little heuristic test is that with my dataset (93 variables long)
In [16]: %%timeit
...: for v in dataset._variables.values():
...: v.ndim
...:
119 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [17]: %%timeit
...: for v in dataset._variables.values():
...: v.shape
...:
105 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [18]: %%timeit
...: for v in dataset._variables.values():
...: v.dims
...:
7.66 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [19]: %%timeit
...: for v in dataset._variables.values():
...: v._dims
...:
3.1 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [20]: len(dataset._variables)
93
I mean, micro-optimizations are sometimes dumb. So that is why i've been breaking them out into distinct ideas when I find them, but together they can add up, especially when taken together.
So in other words, my hypothesis is that the the use of _dims
is really helpful because it avoids many indirections in shape
since dims is a "cached" version of the shape (where every number is replaced with a string).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len(v.dims)
or len(v._dims)
sounds OK to me. They're both readily understandable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I better understand the xarray style,
The truthyness of tuples is not obvious enough while Len(tuple) is more obviously associated with a true/false statement
Would a comment be ok if Len(tuple) hurts performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It not about style, but about readability and understandability.
I've read this snippet about 6 times now, but I still have to look at it closely to see what it does. The perf improvement is also sensitive to order of iteration over variables (what if you alternated between 0D and 1D variable as you iterated through?)
This is why I'd prefer an explicit check for scalar variable. It's easy to see and reason about the special-case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if you alternated between 0D and 1D variable as you iterated through?
You know, this is something I've thought about alot.
I'm generally not too happy with this optimization.
This is why I'd prefer an explicit check for scalar variable. It's easy to see and reason about the special-case.
Ok understood. The challenge is that this PR doesn't do much on my benchmarks without #9002 and my current theory is that we are limited by calls to python methods, so I feel like even len(tuple)
will slow things down.
I'll try again, but if its OK, i'm going to rebase onto #9002 until a resolution is found for those optimizations.
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
ef95538
to
83b0599
Compare
On main:
On this branch
On the combined #9002 + this branch:
|
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition.
For example, we have about 80 of these in our datasets (and I want to incrase this number)
Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands.
However, it has become quite slow to index in the dataset.
We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application.
These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile:
Thanks for considering.
whats-new.rst
api.rst
xref: #2799
xref: #7045