-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelising cubes concatenation #5750
Comments
I may be out of my depth here, but is a [direct] hash of floating point values really useful? I imagine that some rounding, possibly user-adjustable, would prevent unwarranted down-to-last-bit-agreement requirement. |
Hi @larsbarring, good point indeed. But I believe that the current implementation is also based on array equality comparisons (without tolerance)? As far as I understand coordinate comparisons use |
Yes, I am afraid that is the case (without knowing exactly how it is done), which has bitten us several times (e.g. here). Why enforce that level of precision when comparing floating point values? |
Let's have a go at the hashing solution! |
Aah -- this sounds promising :-)) pinging @ljoakim for info |
In ESMValTool, we have some recipes that requires concatenating a long list of cubes.
This is done with
CubeList.concatenate
, which, as far as I understand, loops over the cubes, identifies cubes with compatible signatures, and check whether the coordinates of matching cubes are equal (inproto_cube.register
):iris/lib/iris/_concatenate.py
Lines 335 to 353 in f8a45be
As in #5743 , we observe significant slowdown when we try to act on a list of cubes with lazy coordinates. These coordinates, in fact, need to be computed (read from disk) in order to compare them with the ones of other cubes, and this is now carried out sequentially. Of course we could skip the auxiliary/derived coordinate comparison (e.g.
check_aux_coords=False
), but we would like to keep all checks to make sure the concatenation is robust.We are thinking to possible approaches to speed up the concatenation for similar use cases, i.e. long lists of cubes with lazy coordinates. One way to do this would be to "realise" all coordinates, which could be done in parallel for all cubes- but this would bring quite some memory usage as a disadvantage.
Another potential strategy (suggested by my colleague @bouweandela) would be to load coordinates, hash them, and only store hashes, to be used later on for comparisons between cubes. By running the coordinate loading & hashing for all cubes in parallel one could get quite some performance improvement without significantly increasing the memory footprint.
What are your thoughts on this? How would you see the (optional) usage of hashes for the coordinate comparison when concatenating cubes?
The text was updated successfully, but these errors were encountered: