Parallelising cubes concatenation #5750

fnattino · 2024-02-16T10:56:17Z

In ESMValTool, we have some recipes that requires concatenating a long list of cubes.

This is done with CubeList.concatenate, which, as far as I understand, loops over the cubes, identifies cubes with compatible signatures, and check whether the coordinates of matching cubes are equal (in proto_cube.register):

iris/lib/iris/_concatenate.py

Lines 335 to 353 in f8a45be

    
           for cube in cubes: 
        
               name = cube.standard_name or cube.long_name 
        
               proto_cubes = proto_cubes_by_name[name] 
        
               registered = False 
        
               # Register cube with an existing proto-cube. 
        
               for proto_cube in proto_cubes: 
        
                   registered = proto_cube.register( 
        
                       cube, 
        
                       axis, 
        
                       error_on_mismatch, 
        
                       check_aux_coords, 
        
                       check_cell_measures, 
        
                       check_ancils, 
        
                       check_derived_coords, 
        
                   ) 
        
                   if registered: 
        
                       axis = proto_cube.axis 
        
                       break

As in #5743 , we observe significant slowdown when we try to act on a list of cubes with lazy coordinates. These coordinates, in fact, need to be computed (read from disk) in order to compare them with the ones of other cubes, and this is now carried out sequentially. Of course we could skip the auxiliary/derived coordinate comparison (e.g. check_aux_coords=False), but we would like to keep all checks to make sure the concatenation is robust.

We are thinking to possible approaches to speed up the concatenation for similar use cases, i.e. long lists of cubes with lazy coordinates. One way to do this would be to "realise" all coordinates, which could be done in parallel for all cubes- but this would bring quite some memory usage as a disadvantage.

Another potential strategy (suggested by my colleague @bouweandela) would be to load coordinates, hash them, and only store hashes, to be used later on for comparisons between cubes. By running the coordinate loading & hashing for all cubes in parallel one could get quite some performance improvement without significantly increasing the memory footprint.

What are your thoughts on this? How would you see the (optional) usage of hashes for the coordinate comparison when concatenating cubes?

The text was updated successfully, but these errors were encountered:

larsbarring · 2024-02-16T13:08:09Z

I may be out of my depth here, but is a [direct] hash of floating point values really useful? I imagine that some rounding, possibly user-adjustable, would prevent unwarranted down-to-last-bit-agreement requirement.

fnattino · 2024-02-20T14:23:44Z

Hi @larsbarring, good point indeed. But I believe that the current implementation is also based on array equality comparisons (without tolerance)? As far as I understand coordinate comparisons use iris.util.array_equal, which accounts for the potential presence of NaN's, but does not include any tolerance factor.

larsbarring · 2024-02-20T15:08:36Z

Yes, I am afraid that is the case (without knowing exactly how it is done), which has bitten us several times (e.g. here). Why enforce that level of precision when comparing floating point values?

trexfeathers · 2024-02-29T14:01:15Z

Let's have a go at the hashing solution!

larsbarring · 2024-02-29T14:12:11Z

Aah -- this sounds promising :-)) pinging @ljoakim for info

trexfeathers added the Feature: ESMValTool label Feb 29, 2024

trexfeathers added this to the v3.10 milestone Feb 29, 2024

bouweandela linked a pull request Apr 25, 2024 that will close this issue

Parallel concatenate #5926

Open

bouweandela mentioned this issue May 15, 2024

Using Dask distributed may slow down preprocessor steps running before concatenate ESMValGroup/ESMValCore#2073

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelising cubes concatenation #5750

Parallelising cubes concatenation #5750

fnattino commented Feb 16, 2024

larsbarring commented Feb 16, 2024

fnattino commented Feb 20, 2024

larsbarring commented Feb 20, 2024 •

edited

trexfeathers commented Feb 29, 2024

larsbarring commented Feb 29, 2024

Parallelising cubes concatenation #5750

Parallelising cubes concatenation #5750

Comments

fnattino commented Feb 16, 2024

larsbarring commented Feb 16, 2024

fnattino commented Feb 20, 2024

larsbarring commented Feb 20, 2024 • edited

trexfeathers commented Feb 29, 2024

larsbarring commented Feb 29, 2024

larsbarring commented Feb 20, 2024 •

edited