Micro optimize dataset.isel for speed on large datasets #9003

hmaarrfk · 2024-05-06T01:48:10Z

This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile:

main (as of Faster fastpath #9001) - 2.5k its/s
With Micro optimizations to improve indexing #9002 - 4.2k its/s
With this Pull Request (on top of Micro optimizations to improve indexing #9002) -- 6.1k its/s

Thanks for considering.

Closes #xxxx
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

xref: #2799
xref: #7045

hmaarrfk · 2024-05-06T02:00:02Z

I'm happy to add benchmarks for these if you think it would help.

That said. I would love to leave that addition for future work. My time spent playing on this kind of speedup is up for the week.

xarray/core/dataset.py

dcherian · 2024-05-06T15:51:02Z

Thanks. Do you see any changes in our asv benchmarks in asv_bench/?

We'd be happy to take updates for those too :)

hmaarrfk · 2024-05-06T15:54:26Z

Thanks. Do you see any changes in our asv benchmarks in asv_bench/?

I didn't get to running asv locally. (was focused on getting pytest + mypy working).

The speedups here are more associated with:

few variables of interest in a dataset.
Many variables with no dims.
Slicing

I don't think this benchmark exists at quick glance. I could create one

dcherian · 2024-05-06T20:00:59Z

xarray/core/dataset.py

+            # Fastpath, skip all of this for variables with no dimensions
+            # Keep the result cached for future dictionary update
+            elif var_dims := var.dims:


Suggested change

# Fastpath, skip all of this for variables with no dimensions

# Keep the result cached for future dictionary update

elif var_dims := var.dims:

elif var.ndim == 0:

continue

else:

Does this work

no wait, i spoke too soon. i had a typo. oddly, it is slower...

diff --git a/xarray/core/dataset.py b/xarray/core/dataset.py index ec756176..4e8c31e5 100644 --- a/xarray/core/dataset.py +++ b/xarray/core/dataset.py @@ -2987,22 +2987,20 @@ class Dataset( if name in index_variables: var = index_variables[name] dims.update(zip(var.dims, var.shape)) - # Fastpath, skip all of this for variables with no dimensions - # Keep the result cached for future dictionary update - elif var_dims := var.dims: + elif var.ndim == 0: + continue + else: # Large datasets with alot of metadata may have many scalars # without any relevant dimensions for slicing. # Pick those out quickly and avoid paying the cost below # of resolving the var_indexers variables - if var_indexer_keys := all_keys.intersection(var_dims): + if var_indexer_keys := all_keys.intersection(var.dims): var_indexers = {k: indexers[k] for k in var_indexer_keys} var = var.isel(var_indexers) if drop and var.ndim == 0 and name in coord_names: coord_names.remove(name) continue - # Update our reference to `var_dims` after the call to isel - var_dims = var.dims - dims.update(zip(var_dims, var.shape)) + dims.update(zip(var.dims, var.shape)) variables[name] = var return self._construct_direct(

was slower.... this is somewhat unexpected. ndim should be "instant".

let me add a benchmark tonight to "show" that this is the better way explicitely, otherwise it will be too easy to undo.

My conclusion is that:

len(tuple) seems to be pretty fast.

But the .shape attribute is only resolved after 4-5 different python indirections going down to a LazilyIndexedArray, MemoryCachedArray, H5BackedArray (sorry, i'm not getting the class names right), but ultimately it isn't "readily available and needs to be resolved.

My little heuristic test is that with my dataset (93 variables long)

In [16]: %%timeit ...: for v in dataset._variables.values(): ...: v.ndim ...: 119 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [17]: %%timeit ...: for v in dataset._variables.values(): ...: v.shape ...: 105 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [18]: %%timeit ...: for v in dataset._variables.values(): ...: v.dims ...: 7.66 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) In [19]: %%timeit ...: for v in dataset._variables.values(): ...: v._dims ...: 3.1 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) In [20]: len(dataset._variables) 93

I mean, micro-optimizations are sometimes dumb. So that is why i've been breaking them out into distinct ideas when I find them, but together they can add up, especially when taken together.

So in other words, my hypothesis is that the the use of _dims is really helpful because it avoids many indirections in shape since dims is a "cached" version of the shape (where every number is replaced with a string).

len(v.dims) or len(v._dims) sounds OK to me. They're both readily understandable

Just so I better understand the xarray style,

The truthyness of tuples is not obvious enough while Len(tuple) is more obviously associated with a true/false statement

Would a comment be ok if Len(tuple) hurts performance?

It not about style, but about readability and understandability.

I've read this snippet about 6 times now, but I still have to look at it closely to see what it does. The perf improvement is also sensitive to order of iteration over variables (what if you alternated between 0D and 1D variable as you iterated through?)

This is why I'd prefer an explicit check for scalar variable. It's easy to see and reason about the special-case.

what if you alternated between 0D and 1D variable as you iterated through?

You know, this is something I've thought about alot.

I'm generally not too happy with this optimization.

This is why I'd prefer an explicit check for scalar variable. It's easy to see and reason about the special-case.

Ok understood. The challenge is that this PR doesn't do much on my benchmarks without #9002 and my current theory is that we are limited by calls to python methods, so I feel like even len(tuple) will slow things down.

I'll try again, but if its OK, i'm going to rebase onto #9002 until a resolution is found for those optimizations.

This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.

hmaarrfk · 2024-05-07T01:01:30Z

On main:

[50.00%] ··· Running (indexing.IndexingDask.time_indexing_vectorized--).                                              
[56.25%] ··· indexing.Indexing.time_indexing_basic                                                                 ok 
[56.25%] ··· ================== ===========                                                                           
                    key                                                                                               
             ------------------ -----------                                                                           
                  1scalar        173±0.7μs                                                                            
                   1slice        180±0.9μs                                                                            
               1slice-1scalar     219±1μs                                                                             
              2slicess-1scalar    301±2μs  
             ================== ===========

[62.50%] ··· indexing.Indexing.time_indexing_basic_ds_large                                                        ok
[62.50%] ··· ================== =============
                    key                       
             ------------------ -------------
                  1scalar        3.07±0.02ms 
                   1slice        3.08±0.01ms 
               1slice-1scalar    3.17±0.01ms 
              2slicess-1scalar   3.30±0.02ms 
             ================== =============

On this branch

[ 0.00%] ·· Benchmarking existing-py_home_mark_miniforge3_envs_xr_bin_python
[25.00%] ··· Running (indexing.Indexing.time_indexing_basic--)..
[75.00%] ··· indexing.Indexing.time_indexing_basic                                                                 ok
[75.00%] ··· ================== ===========
                    key                    
             ------------------ -----------
                  1scalar        172±0.9μs 
                   1slice        179±0.7μs 
               1slice-1scalar     217±1μs  
              2slicess-1scalar    299±1μs  
             ================== ===========

[100.00%] ··· indexing.Indexing.time_indexing_basic_ds_large                                                        ok
[100.00%] ··· ================== =============
                     key                      
              ------------------ -------------
                   1scalar        2.67±0.01ms 
                    1slice        2.67±0.01ms 
                1slice-1scalar    2.71±0.01ms 
               2slicess-1scalar   2.81±0.01ms 
              ================== =============

On the combined #9002 + this branch:

[ 0.00%] ·· Benchmarking existing-py_home_mark_miniforge3_envs_xr_bin_python
[25.00%] ··· Running (indexing.Indexing.time_indexing_basic--)..
[75.00%] ··· indexing.Indexing.time_indexing_basic                                                                 ok
[75.00%] ··· ================== ===========
                    key                    
             ------------------ -----------
                  1scalar        155±0.5μs 
                   1slice         146±1μs  
               1slice-1scalar     182±1μs  
              2slicess-1scalar    233±1μs  
             ================== ===========

[100.00%] ··· indexing.Indexing.time_indexing_basic_ds_large                                                        ok
[100.00%] ··· ================== =============
                     key                      
              ------------------ -------------
                   1scalar        2.67±0.01ms 
                    1slice        2.65±0.01ms 
                1slice-1scalar    2.71±0.02ms 
               2slicess-1scalar   2.77±0.01ms 
              ================== =============

hmaarrfk force-pushed the micro_datasetisel branch from 021ba45 to 9128c7c Compare May 6, 2024 01:54

hmaarrfk marked this pull request as ready for review May 6, 2024 01:58

dcherian reviewed May 6, 2024

View reviewed changes

xarray/core/dataset.py Outdated Show resolved Hide resolved

dcherian reviewed May 6, 2024

View reviewed changes

xarray/core/dataset.py Outdated Show resolved Hide resolved

dcherian added the run-benchmark Run the ASV benchmark workflow label May 6, 2024

dcherian reviewed May 6, 2024

View reviewed changes

hmaarrfk force-pushed the micro_datasetisel branch from ef95538 to 83b0599 Compare May 6, 2024 20:14

hmaarrfk mentioned this pull request May 7, 2024

Add a benchmark to monitor performance for large dataset indexing #9012

Merged

4 tasks

Merge branch 'main' into micro_datasetisel

275ea36

hmaarrfk marked this pull request as draft May 18, 2024 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Micro optimize dataset.isel for speed on large datasets #9003

Micro optimize dataset.isel for speed on large datasets #9003

hmaarrfk commented May 6, 2024 •

edited

hmaarrfk commented May 6, 2024

dcherian commented May 6, 2024

hmaarrfk commented May 6, 2024

dcherian May 6, 2024

This comment was marked as outdated.

hmaarrfk May 6, 2024

hmaarrfk May 6, 2024

hmaarrfk May 6, 2024

hmaarrfk May 7, 2024

dcherian May 7, 2024

hmaarrfk May 8, 2024

dcherian May 17, 2024

hmaarrfk May 18, 2024

hmaarrfk commented May 7, 2024

Micro optimize dataset.isel for speed on large datasets #9003

Are you sure you want to change the base?

Micro optimize dataset.isel for speed on large datasets #9003

Conversation

hmaarrfk commented May 6, 2024 • edited

hmaarrfk commented May 6, 2024

dcherian commented May 6, 2024

hmaarrfk commented May 6, 2024

Choose a reason for hiding this comment

This comment was marked as outdated.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmaarrfk commented May 7, 2024

hmaarrfk commented May 6, 2024 •

edited