Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro optimize dataset.isel for speed on large datasets #9003

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

hmaarrfk
Copy link
Contributor

@hmaarrfk hmaarrfk commented May 6, 2024

This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile:

Thanks for considering.

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

xref: #2799
xref: #7045

@hmaarrfk hmaarrfk marked this pull request as ready for review May 6, 2024 01:58
@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented May 6, 2024

I'm happy to add benchmarks for these if you think it would help.

That said. I would love to leave that addition for future work. My time spent playing on this kind of speedup is up for the week.

xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
@dcherian dcherian added the run-benchmark Run the ASV benchmark workflow label May 6, 2024
@dcherian
Copy link
Contributor

dcherian commented May 6, 2024

Thanks. Do you see any changes in our asv benchmarks in asv_bench/?

We'd be happy to take updates for those too :)

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented May 6, 2024

Thanks. Do you see any changes in our asv benchmarks in asv_bench/?

I didn't get to running asv locally. (was focused on getting pytest + mypy working).

The speedups here are more associated with:

  1. few variables of interest in a dataset.
  2. Many variables with no dims.
  3. Slicing

I don't think this benchmark exists at quick glance. I could create one

Comment on lines +2990 to +2992
# Fastpath, skip all of this for variables with no dimensions
# Keep the result cached for future dictionary update
elif var_dims := var.dims:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Fastpath, skip all of this for variables with no dimensions
# Keep the result cached for future dictionary update
elif var_dims := var.dims:
elif var.ndim == 0:
continue
else:

Does this work

This comment was marked as outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no wait, i spoke too soon. i had a typo. oddly, it is slower...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff --git a/xarray/core/dataset.py b/xarray/core/dataset.py
index ec756176..4e8c31e5 100644
--- a/xarray/core/dataset.py
+++ b/xarray/core/dataset.py
@@ -2987,22 +2987,20 @@ class Dataset(
             if name in index_variables:
                 var = index_variables[name]
                 dims.update(zip(var.dims, var.shape))
-            # Fastpath, skip all of this for variables with no dimensions
-            # Keep the result cached for future dictionary update
-            elif var_dims := var.dims:
+            elif var.ndim == 0:
+                continue
+            else:
                 # Large datasets with alot of metadata may have many scalars
                 # without any relevant dimensions for slicing.
                 # Pick those out quickly and avoid paying the cost below
                 # of resolving the var_indexers variables
-                if var_indexer_keys := all_keys.intersection(var_dims):
+                if var_indexer_keys := all_keys.intersection(var.dims):
                     var_indexers = {k: indexers[k] for k in var_indexer_keys}
                     var = var.isel(var_indexers)
                     if drop and var.ndim == 0 and name in coord_names:
                         coord_names.remove(name)
                         continue
-                    # Update our reference to `var_dims` after the call to isel
-                    var_dims = var.dims
-                dims.update(zip(var_dims, var.shape))
+                dims.update(zip(var.dims, var.shape))
             variables[name] = var

         return self._construct_direct(

was slower.... this is somewhat unexpected. ndim should be "instant".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me add a benchmark tonight to "show" that this is the better way explicitely, otherwise it will be too easy to undo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My conclusion is that:

  • len(tuple) seems to be pretty fast.
  • But the .shape attribute is only resolved after 4-5 different python indirections going down to a LazilyIndexedArray, MemoryCachedArray, H5BackedArray (sorry, i'm not getting the class names right), but ultimately it isn't "readily available and needs to be resolved.

My little heuristic test is that with my dataset (93 variables long)

In [16]: %%timeit
    ...: for v in dataset._variables.values():
    ...:     v.ndim
    ...:
119 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [17]: %%timeit
    ...: for v in dataset._variables.values():
    ...:     v.shape
    ...:
105 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [18]: %%timeit
    ...: for v in dataset._variables.values():
    ...:     v.dims
    ...:
7.66 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [19]: %%timeit
    ...: for v in dataset._variables.values():
    ...:     v._dims
    ...:
3.1 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [20]: len(dataset._variables)
93

I mean, micro-optimizations are sometimes dumb. So that is why i've been breaking them out into distinct ideas when I find them, but together they can add up, especially when taken together.

So in other words, my hypothesis is that the the use of _dims is really helpful because it avoids many indirections in shape since dims is a "cached" version of the shape (where every number is replaced with a string).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(v.dims) or len(v._dims) sounds OK to me. They're both readily understandable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I better understand the xarray style,

The truthyness of tuples is not obvious enough while Len(tuple) is more obviously associated with a true/false statement

Would a comment be ok if Len(tuple) hurts performance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It not about style, but about readability and understandability.

I've read this snippet about 6 times now, but I still have to look at it closely to see what it does. The perf improvement is also sensitive to order of iteration over variables (what if you alternated between 0D and 1D variable as you iterated through?)

This is why I'd prefer an explicit check for scalar variable. It's easy to see and reason about the special-case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you alternated between 0D and 1D variable as you iterated through?

You know, this is something I've thought about alot.

I'm generally not too happy with this optimization.

This is why I'd prefer an explicit check for scalar variable. It's easy to see and reason about the special-case.

Ok understood. The challenge is that this PR doesn't do much on my benchmarks without #9002 and my current theory is that we are limited by calls to python methods, so I feel like even len(tuple) will slow things down.

I'll try again, but if its OK, i'm going to rebase onto #9002 until a resolution is found for those optimizations.

This targets optimization for datasets with many "scalar" variables
(that is variables without any dimensions). This can happen in the
context where you have many pieces of small metadata that relate to
various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to
incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we
often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to
doing anything with our dataset, but that isn't quite possible with you
want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of
the benchmark, I think they are quite worthwhile:

* main (as of pydata#9001) - 2.5k its/s
* With pydata#9002 - 4.2k its/s
* With this Pull Request (on top of pydata#9002) -- 6.1k its/s

Thanks for considering.
@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented May 7, 2024

On main:

[50.00%] ··· Running (indexing.IndexingDask.time_indexing_vectorized--).                                              
[56.25%] ··· indexing.Indexing.time_indexing_basic                                                                 ok 
[56.25%] ··· ================== ===========                                                                           
                    key                                                                                               
             ------------------ -----------                                                                           
                  1scalar        173±0.7μs                                                                            
                   1slice        180±0.9μs                                                                            
               1slice-1scalar     219±1μs                                                                             
              2slicess-1scalar    301±2μs  
             ================== ===========

[62.50%] ··· indexing.Indexing.time_indexing_basic_ds_large                                                        ok
[62.50%] ··· ================== =============
                    key                       
             ------------------ -------------
                  1scalar        3.07±0.02ms 
                   1slice        3.08±0.01ms 
               1slice-1scalar    3.17±0.01ms 
              2slicess-1scalar   3.30±0.02ms 
             ================== =============

On this branch

[ 0.00%] ·· Benchmarking existing-py_home_mark_miniforge3_envs_xr_bin_python
[25.00%] ··· Running (indexing.Indexing.time_indexing_basic--)..
[75.00%] ··· indexing.Indexing.time_indexing_basic                                                                 ok
[75.00%] ··· ================== ===========
                    key                    
             ------------------ -----------
                  1scalar        172±0.9μs 
                   1slice        179±0.7μs 
               1slice-1scalar     217±1μs  
              2slicess-1scalar    299±1μs  
             ================== ===========

[100.00%] ··· indexing.Indexing.time_indexing_basic_ds_large                                                        ok
[100.00%] ··· ================== =============
                     key                      
              ------------------ -------------
                   1scalar        2.67±0.01ms 
                    1slice        2.67±0.01ms 
                1slice-1scalar    2.71±0.01ms 
               2slicess-1scalar   2.81±0.01ms 
              ================== =============

On the combined #9002 + this branch:

[ 0.00%] ·· Benchmarking existing-py_home_mark_miniforge3_envs_xr_bin_python
[25.00%] ··· Running (indexing.Indexing.time_indexing_basic--)..
[75.00%] ··· indexing.Indexing.time_indexing_basic                                                                 ok
[75.00%] ··· ================== ===========
                    key                    
             ------------------ -----------
                  1scalar        155±0.5μs 
                   1slice         146±1μs  
               1slice-1scalar     182±1μs  
              2slicess-1scalar    233±1μs  
             ================== ===========

[100.00%] ··· indexing.Indexing.time_indexing_basic_ds_large                                                        ok
[100.00%] ··· ================== =============
                     key                      
              ------------------ -------------
                   1scalar        2.67±0.01ms 
                    1slice        2.65±0.01ms 
                1slice-1scalar    2.71±0.02ms 
               2slicess-1scalar   2.77±0.01ms 
              ================== =============

@hmaarrfk hmaarrfk marked this pull request as draft May 18, 2024 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmark Run the ASV benchmark workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants