-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure that initialize_lebedev is called --> Clean up spaghetti in cubature #2736
base: master
Are you sure you want to change the base?
Conversation
Previously, initialize_lebedev was never called, and in fact was getting optimized out of the module completely upon compilation. When lebedev_mappping_[] is then accessed across multiple OpenMP threads, the std::map is empty, and a deadlock can happen where two threads try to access-write (since [key] fills if key is not found), and the slightly slower thread ends up in a Bad State where it thinks there is a value but ends up infinitely looping on the lookup (the program will hang on []). This only happens once every several thousand runs, and only when running with a high degree of parallelism in a system with many atoms. I cannot induce it in captivity, but I have observed it in the wild. Anyway, [] accesses on std::map aren't thread-safe if you aren't super-duper sure the map is fully filled for all keys you'd ever look up, which *should* be the case if initialize_lebedev was ever called anywhere. But it wasn't, and that was Bad. Now it's called exactly once (thanks, c++11's call_once!). The hangs should be gone, though I'll have to churn through another several thousand runs to likely be sure (as, again, it is a very rare kind of hang). That said, as far as I can tell, besides one print function the resulting order_ that's assigned to is never *used*. Maybe a candidate to be axed in the future?
More worryingly, I also see another psi4/psi4/src/psi4/libfock/cubature.cc Line 5065 in ac8f87a
Or have I misunderstood something basic? 😆 |
Thanks for the thorough report and fix. The seemingly extra declaration @susilehtola pointed out confused me, too, though perhaps https://stackoverflow.com/a/17392441 is the answer. The map looks to be from the boost-to-avoid-c++11-standard era, so it could be improved (https://github.com/psi4/psi4archive/blame/1.0.x/src/lib/libfock/cubature.h#L302). |
The "other" lebedev_mapping_: That's just the declaration for it.
The problem is that `SphericalGrid::build` is a _static_ method so the
constructor isn't called when that happens. Someone might yank out the
`new` there and still statically access `lebedev_mapping_` and we are back
where we started. There's no guarantee of construction, so I stapled it
into the one place it's actually used.
(And, again, I don't understand why it's used at all as it never seems to
show up downstream)
Even if it was in the constructor though (which would fire on the _new_) it
would still need the mutex to ensure it's initialized once as
lebedev_mapping_ is also (purposefully) static. No sense doing the rebuild
of the map on every single object instantiation.
…On Wed, Oct 5, 2022, 4:59 AM Susi Lehtola ***@***.***> wrote:
lebedev_mapping_ is a member of SphericalGrid, so initialize_lebedev()
should be called in the constructor of SphericalGrid. No need to add
mutexes etc.
More worryingly, I also see another lebedev_mapping_ in cubature.cc
https://github.com/psi4/psi4/blob/ac8f87a1dd3fdda2aabc3318713d6e5ce00e2c70/psi4/src/psi4/libfock/cubature.cc#L5065
—
Reply to this email directly, view it on GitHub
<#2736 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABTN7JCEAMFQBY2H6OKMPYLWBU7NRANCNFSM6AAAAAAQ5DTWOY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
The initialization code is pretty simple, however. One could just replace the psi4/psi4/src/psi4/libfock/cubature.cc Line 5065 in ac8f87a
This would also avoid the need for the mutexes. The code is hacky also elsewhere; |
I will admit I considered that, then decided I didn't want to reformat the map into the initializer because it was long and I was lazy. My bigger concern is still: what was the consequence of
And I can't find anywhere that seems to use Unfortunately, trying to look further back in the history of
I'm running through tests now to see if just... removing this entirely breaks anything. Short of |
Haha yes spaghetti code! |
Anyone wanting to look back in psi4 history beyond 2016 (when history was rewritten to shrink the codebase by 90% by getting rid of boost tarballs) would need to use the psi4/psi4archive repo, https://github.com/psi4/psi4archive/blame/1.0.x/src/lib/libfock/cubature.h#L302 . Yes, acknowledged spaghetti :-( |
The original coder probably added it anticipating it would be useful, but it seems to have never been used. I'm happy to see unused code burn. |
Burn, baby, burn. Patch incoming... |
|
Shipping it. |
Both RadialGrid and SphericalGrid were constructed but never used by anything except a somewhat obscure print situation. All real logic handled by Mngr classes. Cleaning up to make things easier to hack on in the future. All tests pass, so whatever this code was added for was possibly never implemented, or unimportant. If something obscure in 3rd party ISA tools breaks, though, this is the commit to (probably) look at...
I think I have a version that maintains the nice printing functionality (and shoves the necessary info into a struct, and actually uses the nicer |
(Note that I expect this will probably fail to build or run but I cannot debug with an incremental build until late tonight at best, probably tomorrow) |
Clearly I can't get this working right now and I won't be able to touch this from a useful computer until tomorrow. Sorry. |
I think if rs[A] is so big that it's length won't fit in a regular old `int`, we have a bigger problem...
I now have a version that compiles at home, though I'm not getting the narrowing warning-as-error that the auto-builder is. Still I've patched for it. This should work now... |
This PR does more than the description tells me. Why are we removing so much? |
|
||
return std::shared_ptr<RadialGrid>(grid); | ||
} | ||
std::shared_ptr<RadialGrid> RadialGrid::build_treutler(int npoints, double alpha, int Z) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only Treutler version with correct Eta values. It does give slightly different results compared to the currently used "Treutler" where Eta is always 1. Though I had no time to try to confirm if this one is fully correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it's correct, it was never actually used anywhere. The only use of the RadialGrid
built was to provide data in print_details
-- it's not accessed by anything or anyone outside of that function, at least as far as all of the psi4 tests can determine. In addition, the only calls to RadialGrid::build specifically were of the type RadialGrid::build("Unknown, ...
which never invoked build_becke
or build_treutler
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, i tested it in a local build. So many parts of cubature.cc are messy unfortunately. Could you move the Eta table to line 2045 as a comment and add a note that the eta mapping is missing, please?
When I added the table long time ago I wasn't aware these routines were never called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
psi4/src/psi4/libfock/cubature.h
Outdated
int npoints_; | ||
/// Alpha scale (for user reference) | ||
double alpha_; | ||
/// Nodes (including alpha) | ||
double* r_; | ||
/// Weights (including alpha and r^2) | ||
double* w_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not introduce raw pointers, when their use can easily be avoided. You can just duplicate the data into std::vector<double>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed (I think? My C++ is very rusty, if you haven't noticed...).
int Lsphere = spherical_grids_[A][R]->order(); | ||
printer->Printf(" Node: %4zu, R = %11.3E, WR = %11.3E, Nsphere = %6d, Lsphere = %6d\n", R, Rval, Wval, | ||
Nsphere, Lsphere); | ||
printer->Printf(" Atom: %4zu, Nrad = %6d, Alpha = %11.3E:\n", A, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear on this point: radial_grids_
exists now as a convenience struct for the sake of this printout, but the data being printed actually existed and is used elsewhere. Is that right, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, radial_grids_
is a convenience struct. I checked, and it looks like, e.g., r
and wr
do get used downstream, or at least, used in the construction of the MassPoint
s that get push_back
'd into the atomic_grids_
of MolecularGrid
. I suppose I haven't checked what happens, long-term, with those, but hopefully something...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, in that case, does it make more sense to not print this at all but rely on the user to call MassPoint.print()
or MassPoint.print_details()
when necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this and my conclusion was that grid pruning and cutoff thresholds also acts on the MassPoint
s, so recovering things like the initial lebedev grid or number of grid points isn't possible. The MassPoint
s don't really carry around sufficient information to reconstruct the state printed in print_details
. But I'm not very familiar with the codebase OR what's happening under-the-hood here, so another set of eyes to check my assertion that this Cannot Be Done would be appreciated.
However, if ya'll are OK ditching print_details
all together, then all these problems go away. The replacement structs were added b/c of @susilehtola 's requests to not break the existing print information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any information that isn't actually used except in printing, I'm okay removing.
I've made a new PR #2743 to JUST fix the hang bug, and then we can keep bikeshedding here on what we actually want to do about
Obviously this patch would undo all of that patch but I think the scope of this discussion exceeded the original PR statement. |
This is turning into a spaghetti PR, so breakup appreciated. |
Thanks for getting the most important part of the PR cleared away. As far as I can see, the next steps are:
This part of the code is not pretty, and I commend you for dealing with it at all. |
I'm not sure how much rebasing you want to do, but I think things are back to the state-of-the-world that we were discussing prior to #2743
"Used" is a funny word here. No tests when this print is removed, so it's not so important that someone wrote a test to preserve it. Furthermore, the If by "used" you mean that the piece of information represented in the print is used downstream: psi4/psi4/src/psi4/libfock/cubature.cc Lines 3814 to 3822 in 9db9100
I'm not sure if we care about diagnostic print info on the intermediaries (the radii of shells around atoms we care about, and the number of angular points we end up picking for each shell). Maybe we do? If so, a struct seems like the most reasonable solution. I suppose, all-in-all, it's not so much data to carry around. Appending that data onto every But we definitely should clean up/remove the rest of the old |
1 sounds good. On 2., yes, agreed that high-level, having both Are the |
Too many changes since original review
I've been quiet b/c I am hoping the other primary devs will chime in here -- I'm not sure how to move forwards given conflicting desires (and given that the immediate urgency of "this causes program hangs" has been fixed). |
This I would be in favour of removing the classes at the cost of removing the detailed printing. |
Closes #2735
Previously,
initialize_lebedev
was never called, and in fact was getting optimized out of the module completely upon compilation. Whenlebedev_mappping_[]
is then accessed across multiple OpenMP threads, the std::map is empty, and a deadlock can happen where two threads try to access-write (since [key] fills if key is not found), and the slightly slower thread ends up in a Bad State where it thinks there is a value but ends up infinitely looping on the lookup (the program will hang on[]
).This only happens once every several thousand runs, and only when running with a high degree of parallelism in a system with many atoms. I cannot induce it in captivity, but I have observed it in the wild.
Anyway,
[]
accesses on std::map aren't thread-safe if you aren't super-duper sure the map is fully filled for all keys you'd ever look up, which should be the case ifinitialize_lebedev
was ever called anywhere. But it wasn't, and that was Bad.Now it's called exactly once (thanks, c++11's
call_once
! I do see that this isn't used anywhere else in the code, but I do see mutex is imported in several files, so I don't think I'm adding any new deps here).The hangs should be gone, though I'll have to churn through another several thousand runs to likely be sure (as, again, it is a very rare kind of hang). This will take me a few days to confirm, but given all debugging efforts point to this being the problem, I'm like 99% confident this will do the trick.
That said, as far as I can tell, besides one print function the resulting order_ that's assigned to is never used. Maybe a candidate to be axed in the future?
Description
Actually invokes initialize_lebedev before accessing lebedev_mapping_ to ensure the mapping has values, and prevents a deadlock when running in parallel
User API & Changelog headlines
Prevents a nasty, rare hang
Dev notes & details
See the main PR body
Questions
order_
actually do in SphericalGrid? It never appears to be used anywhere except one print function that also appears unused.Checklist
Status