Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelising cubes concatenation #5750

Open
fnattino opened this issue Feb 16, 2024 · 5 comments · May be fixed by #5926
Open

Parallelising cubes concatenation #5750

fnattino opened this issue Feb 16, 2024 · 5 comments · May be fixed by #5926

Comments

@fnattino
Copy link
Contributor

In ESMValTool, we have some recipes that requires concatenating a long list of cubes.

This is done with CubeList.concatenate, which, as far as I understand, loops over the cubes, identifies cubes with compatible signatures, and check whether the coordinates of matching cubes are equal (in proto_cube.register):

for cube in cubes:
name = cube.standard_name or cube.long_name
proto_cubes = proto_cubes_by_name[name]
registered = False
# Register cube with an existing proto-cube.
for proto_cube in proto_cubes:
registered = proto_cube.register(
cube,
axis,
error_on_mismatch,
check_aux_coords,
check_cell_measures,
check_ancils,
check_derived_coords,
)
if registered:
axis = proto_cube.axis
break

As in #5743 , we observe significant slowdown when we try to act on a list of cubes with lazy coordinates. These coordinates, in fact, need to be computed (read from disk) in order to compare them with the ones of other cubes, and this is now carried out sequentially. Of course we could skip the auxiliary/derived coordinate comparison (e.g. check_aux_coords=False), but we would like to keep all checks to make sure the concatenation is robust.

We are thinking to possible approaches to speed up the concatenation for similar use cases, i.e. long lists of cubes with lazy coordinates. One way to do this would be to "realise" all coordinates, which could be done in parallel for all cubes- but this would bring quite some memory usage as a disadvantage.

Another potential strategy (suggested by my colleague @bouweandela) would be to load coordinates, hash them, and only store hashes, to be used later on for comparisons between cubes. By running the coordinate loading & hashing for all cubes in parallel one could get quite some performance improvement without significantly increasing the memory footprint.

What are your thoughts on this? How would you see the (optional) usage of hashes for the coordinate comparison when concatenating cubes?

@larsbarring
Copy link
Contributor

I may be out of my depth here, but is a [direct] hash of floating point values really useful? I imagine that some rounding, possibly user-adjustable, would prevent unwarranted down-to-last-bit-agreement requirement.

@fnattino
Copy link
Contributor Author

Hi @larsbarring, good point indeed. But I believe that the current implementation is also based on array equality comparisons (without tolerance)? As far as I understand coordinate comparisons use iris.util.array_equal, which accounts for the potential presence of NaN's, but does not include any tolerance factor.

@larsbarring
Copy link
Contributor

larsbarring commented Feb 20, 2024

Yes, I am afraid that is the case (without knowing exactly how it is done), which has bitten us several times (e.g. here). Why enforce that level of precision when comparing floating point values?

@trexfeathers
Copy link
Contributor

Let's have a go at the hashing solution!

@trexfeathers trexfeathers added this to the v3.10 milestone Feb 29, 2024
@larsbarring
Copy link
Contributor

Aah -- this sounds promising :-)) pinging @ljoakim for info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Status: No status
Development

Successfully merging a pull request may close this issue.

3 participants