Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that initialize_lebedev is called --> Clean up spaghetti in cubature #2736

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

tallakahath
Copy link
Contributor

@tallakahath tallakahath commented Oct 5, 2022

Closes #2735

Previously, initialize_lebedev was never called, and in fact was getting optimized out of the module completely upon compilation. When lebedev_mappping_[] is then accessed across multiple OpenMP threads, the std::map is empty, and a deadlock can happen where two threads try to access-write (since [key] fills if key is not found), and the slightly slower thread ends up in a Bad State where it thinks there is a value but ends up infinitely looping on the lookup (the program will hang on []).

This only happens once every several thousand runs, and only when running with a high degree of parallelism in a system with many atoms. I cannot induce it in captivity, but I have observed it in the wild.

Anyway, [] accesses on std::map aren't thread-safe if you aren't super-duper sure the map is fully filled for all keys you'd ever look up, which should be the case if initialize_lebedev was ever called anywhere. But it wasn't, and that was Bad.

Now it's called exactly once (thanks, c++11's call_once! I do see that this isn't used anywhere else in the code, but I do see mutex is imported in several files, so I don't think I'm adding any new deps here).

The hangs should be gone, though I'll have to churn through another several thousand runs to likely be sure (as, again, it is a very rare kind of hang). This will take me a few days to confirm, but given all debugging efforts point to this being the problem, I'm like 99% confident this will do the trick.

That said, as far as I can tell, besides one print function the resulting order_ that's assigned to is never used. Maybe a candidate to be axed in the future?

Description

Actually invokes initialize_lebedev before accessing lebedev_mapping_ to ensure the mapping has values, and prevents a deadlock when running in parallel

User API & Changelog headlines

Prevents a nasty, rare hang

Dev notes & details

See the main PR body

Questions

  • What does order_ actually do in SphericalGrid? It never appears to be used anywhere except one print function that also appears unused.

Checklist

  • No tests needed -- no observable changes and it's hard to write a test for a very rare race condition :)

Status

  • Ready for review
  • Ready for merge

Previously, initialize_lebedev was never called, and in fact was getting
optimized out of the module completely upon compilation. When
lebedev_mappping_[] is then accessed across multiple OpenMP threads, the
std::map is empty, and a deadlock can happen where two threads try to
access-write (since [key] fills if key is not found), and the slightly
slower thread ends up in a Bad State where it thinks there is a value
but ends up infinitely looping on the lookup (the program will hang on
[]).

This only happens once every several thousand runs, and only when
running with a high degree of parallelism in a system with many atoms. I
cannot induce it in captivity, but I have observed it in the wild.

Anyway, [] accesses on std::map aren't thread-safe if you aren't
super-duper sure the map is fully filled for all keys you'd ever look
up, which *should* be the case if initialize_lebedev was ever called
anywhere. But it wasn't, and that was Bad.

Now it's called exactly once (thanks, c++11's call_once!).

The hangs should be gone, though I'll have to churn through another
several thousand runs to likely be sure (as, again, it is a very rare
kind of hang).

That said, as far as I can tell, besides one print function the
resulting order_ that's assigned to is never *used*. Maybe a candidate
to be axed in the future?
@susilehtola
Copy link
Member

susilehtola commented Oct 5, 2022

lebedev_mapping_ is a member of SphericalGrid, so initialize_lebedev() should be called in the constructor of SphericalGrid. No need to add mutexes etc.

More worryingly, I also see another lebedev_mapping_ in cubature.cc

std::map<int, int> SphericalGrid::lebedev_mapping_;

Or have I misunderstood something basic? 😆

@loriab
Copy link
Member

loriab commented Oct 5, 2022

Thanks for the thorough report and fix. The seemingly extra declaration @susilehtola pointed out confused me, too, though perhaps https://stackoverflow.com/a/17392441 is the answer. The map looks to be from the boost-to-avoid-c++11-standard era, so it could be improved (https://github.com/psi4/psi4archive/blame/1.0.x/src/lib/libfock/cubature.h#L302).

@tallakahath
Copy link
Contributor Author

tallakahath commented Oct 5, 2022 via email

@susilehtola
Copy link
Member

The initialization code is pretty simple, however. One could just replace the initialize_lebedev function by an initializer list in

std::map<int, int> SphericalGrid::lebedev_mapping_;

This would also avoid the need for the mutexes.

The code is hacky also elsewhere; lebedev_error duplicates the same information. For what it's worth, I will be developing a general C++ Becke quadrature library that would eliminate these parts of the code.

@tallakahath
Copy link
Contributor Author

I will admit I considered that, then decided I didn't want to reformat the map into the initializer because it was long and I was lazy.

My bigger concern is still: what was the consequence of order_ being incorrectly 0 for so long, and who actually uses this property up-stream? The git commit where this was added says Hack DFT grids to retain indexing data for ISA but the only files touched were:

src/lib/libfock/cubature.cc.REMOVED.git-id
src/lib/libfock/cubature.h
src/lib/libfock/gridblocker.h

And I can't find anywhere that seems to use order_ or the order() method of SphericalGrid.

Unfortunately, trying to look further back in the history of cubature.cc results in several instances of psi4/src/psi4/libfock/cubature.cc.REMOVED.git-id which (I think?) means things were stripped from the repo at some point, and the resulting commit stuff is all assigned to the wrong person (e.g., nearly all of cubature.cc is assigned to one commit 7e4ecf9 from dgasmith).

print_details is the only place the spherical_grids_ are ever apparently used, and I can comment out the block where they print nicely. That function is only ever used in v.cc and only if print_ > 2. I don't think this is used much, and is creating unnecessary complexity, and doesn't feature in any tests I can find.

I'm running through tests now to see if just... removing this entirely breaks anything. Short of print_details being accessible through some print settings, it doesn't appear that any of this is even exposed on the python side anyway?

@susilehtola
Copy link
Member

I'm running through tests now to see if just... removing this entirely breaks anything. Short of print_details being accessible through some print settings, it doesn't appear that any of this is even exposed on the python side anyway?

Haha yes spaghetti code!

@loriab
Copy link
Member

loriab commented Oct 5, 2022

Anyone wanting to look back in psi4 history beyond 2016 (when history was rewritten to shrink the codebase by 90% by getting rid of boost tarballs) would need to use the psi4/psi4archive repo, https://github.com/psi4/psi4archive/blame/1.0.x/src/lib/libfock/cubature.h#L302 .

Yes, acknowledged spaghetti :-(

@JonathonMisiewicz
Copy link
Contributor

My bigger concern is still: what was the consequence of order_ being incorrectly 0 for so long, and who actually uses this property up-stream? The git commit where this was added says Hack DFT grids to retain indexing data for ISA but the only files touched were:

src/lib/libfock/cubature.cc.REMOVED.git-id
src/lib/libfock/cubature.h
src/lib/libfock/gridblocker.h

And I can't find anywhere that seems to use order_ or the order() method of SphericalGrid.

The original coder probably added it anticipating it would be useful, but it seems to have never been used. I'm happy to see unused code burn.

@tallakahath
Copy link
Contributor Author

100% tests passed, 0 tests failed out of 520

The original coder probably added it anticipating it would be useful, but it seems to have never been used. I'm happy to see unused code burn.

Burn, baby, burn. Patch incoming...

@tallakahath
Copy link
Contributor Author

RadialGrid also looks ripe for burning -- smoketests passed, running full tests now.

@tallakahath
Copy link
Contributor Author

100% tests passed, 0 tests failed out of 520

Shipping it.

Both RadialGrid and SphericalGrid were constructed but never used by
anything except a somewhat obscure print situation. All real logic
handled by Mngr classes. Cleaning up to make things easier to hack on in
the future.

All tests pass, so whatever this code was added for was possibly never
implemented, or unimportant. If something obscure in 3rd party ISA tools
breaks, though, this is the commit to (probably) look at...
@tallakahath
Copy link
Contributor Author

I think I have a version that maintains the nice printing functionality (and shoves the necessary info into a struct, and actually uses the nicer LebedevGridMgr::findOrderByNPoints function and other LebedevGridMgr functions) but I can't actually test if this builds until I get home due to #reasons. I'll shove it upstream on my local branch and see if it breaks in the test builder.

@tallakahath
Copy link
Contributor Author

(Note that I expect this will probably fail to build or run but I cannot debug with an incremental build until late tonight at best, probably tomorrow)

@tallakahath
Copy link
Contributor Author

Clearly I can't get this working right now and I won't be able to touch this from a useful computer until tomorrow. Sorry.

I think if rs[A] is so big that it's length won't fit in a regular old
`int`, we have a bigger problem...
@tallakahath
Copy link
Contributor Author

I now have a version that compiles at home, though I'm not getting the narrowing warning-as-error that the auto-builder is. Still I've patched for it. This should work now...

@hokru
Copy link
Member

hokru commented Oct 8, 2022

This PR does more than the description tells me. Why are we removing so much?


return std::shared_ptr<RadialGrid>(grid);
}
std::shared_ptr<RadialGrid> RadialGrid::build_treutler(int npoints, double alpha, int Z) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only Treutler version with correct Eta values. It does give slightly different results compared to the currently used "Treutler" where Eta is always 1. Though I had no time to try to confirm if this one is fully correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it's correct, it was never actually used anywhere. The only use of the RadialGrid built was to provide data in print_details -- it's not accessed by anything or anyone outside of that function, at least as far as all of the psi4 tests can determine. In addition, the only calls to RadialGrid::build specifically were of the type RadialGrid::build("Unknown, ... which never invoked build_becke or build_treutler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i tested it in a local build. So many parts of cubature.cc are messy unfortunately. Could you move the Eta table to line 2045 as a comment and add a note that the eta mapping is missing, please?
When I added the table long time ago I wasn't aware these routines were never called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 71 to 77
int npoints_;
/// Alpha scale (for user reference)
double alpha_;
/// Nodes (including alpha)
double* r_;
/// Weights (including alpha and r^2)
double* w_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not introduce raw pointers, when their use can easily be avoided. You can just duplicate the data into std::vector<double>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (I think? My C++ is very rusty, if you haven't noticed...).

int Lsphere = spherical_grids_[A][R]->order();
printer->Printf(" Node: %4zu, R = %11.3E, WR = %11.3E, Nsphere = %6d, Lsphere = %6d\n", R, Rval, Wval,
Nsphere, Lsphere);
printer->Printf(" Atom: %4zu, Nrad = %6d, Alpha = %11.3E:\n", A,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear on this point: radial_grids_ exists now as a convenience struct for the sake of this printout, but the data being printed actually existed and is used elsewhere. Is that right, or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, radial_grids_ is a convenience struct. I checked, and it looks like, e.g., r and wr do get used downstream, or at least, used in the construction of the MassPoints that get push_back'd into the atomic_grids_ of MolecularGrid. I suppose I haven't checked what happens, long-term, with those, but hopefully something...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, in that case, does it make more sense to not print this at all but rely on the user to call MassPoint.print() or MassPoint.print_details() when necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this and my conclusion was that grid pruning and cutoff thresholds also acts on the MassPoints, so recovering things like the initial lebedev grid or number of grid points isn't possible. The MassPoints don't really carry around sufficient information to reconstruct the state printed in print_details. But I'm not very familiar with the codebase OR what's happening under-the-hood here, so another set of eyes to check my assertion that this Cannot Be Done would be appreciated.

However, if ya'll are OK ditching print_details all together, then all these problems go away. The replacement structs were added b/c of @susilehtola 's requests to not break the existing print information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any information that isn't actually used except in printing, I'm okay removing.

@tallakahath tallakahath changed the title Ensure that initialize_lebedev is called Ensure that initialize_lebedev is called --> Clean up spaghetti in cubature Oct 10, 2022
@tallakahath
Copy link
Contributor Author

I've made a new PR #2743 to JUST fix the hang bug, and then we can keep bikeshedding here on what we actually want to do about RadialGrid and SphericalGrid but have a fix in the codebase that actually stops the nasty hang that took me down this rabbit hole in the first place.

  • @susilehtola this initializes with an initializer list like you requested
  • @hokru this loses NO information over the prior state-of-the-world like you suggested was desirable (if I was reading correctly)
  • @JonathonMisiewicz I guess you won't like it because spaghetti didn't go away but it's short and sweet, at least.

Obviously this patch would undo all of that patch but I think the scope of this discussion exceeded the original PR statement.

@JonathonMisiewicz
Copy link
Contributor

This is turning into a spaghetti PR, so breakup appreciated.

@JonathonMisiewicz
Copy link
Contributor

Thanks for getting the most important part of the PR cleared away. As far as I can see, the next steps are:

  1. Fix the merge conflicts (annoying rebase)
  2. Determine what information in print_details is actually used. I want to retain print information that is actually used, but I can do away with the rest. Once we have that, we can assess if keeping a modified RadialGrid is the best way to keep the useful print information. (I think we've established that r and wr are used, but order is not?)

This part of the code is not pretty, and I commend you for dealing with it at all.

@tallakahath
Copy link
Contributor Author

  1. Fix the merge conflicts (annoying rebase)

I'm not sure how much rebasing you want to do, but I think things are back to the state-of-the-world that we were discussing prior to #2743

  1. Determine what information in print_details is actually used. I want to retain print information that is actually used, but I can do away with the rest. Once we have that, we can assess if keeping a modified RadialGrid is the best way to keep the useful print information. (I think we've established that r and wr are used, but order is not?)

"Used" is a funny word here. No tests when this print is removed, so it's not so important that someone wrote a test to preserve it. Furthermore, the order_ was always wrong. The other elements weren't, though, but you have to set a pretty high print level to actually see anything (v.cc requires print_ > 2).

If by "used" you mean that the piece of information represented in the print is used downstream: r and wr are the primary candidates here. That said, they aren't used as-is -- they're dotted through other elements found by a lookup in LebedevGridMgr and RadialGridMgr to make MassPoints (see

for (int j = 0; j < numAngPts; j++) {
MassPoint mp = {r[i] * anggrid[j].x, r[i] * anggrid[j].y, r[i] * anggrid[j].z,
wr[i] * anggrid[j].w};
mp = std_orientation.MoveIntoPosition(mp, A);
mp.w *= nuc.computeNuclearWeight(mp, A, stratmannCutoff);
if (std::abs(mp.w) > weightcut) {
atomic_grids_[A].push_back(mp);
}
)

I'm not sure if we care about diagnostic print info on the intermediaries (the radii of shells around atoms we care about, and the number of angular points we end up picking for each shell). Maybe we do? If so, a struct seems like the most reasonable solution. I suppose, all-in-all, it's not so much data to carry around. Appending that data onto every MassPoint however feels like overkill (and that's the only other place to put it).

But we definitely should clean up/remove the rest of the old RadialGrid and SphericalGrid, as these seem to only cause confusion about where changes should be made. RadialGridMgr and LebedevGridMgr are the classes of Actual Importance.

@JonathonMisiewicz
Copy link
Contributor

JonathonMisiewicz commented Oct 11, 2022

1 sounds good.

On 2., yes, agreed that high-level, having both Grid and GridMgr is confusing, and that's the issue to deal with. If theGrid classes are only used for print purposes, then while the print functionality belongs somewhere, it's not there. Keeping this old code around doesn't make that task easier doesn't make print setup in the correct place any easier. We should then just be rid of the classes entirely, printing included.

Are the Grid classes used for something other than printing?

@JonathonMisiewicz JonathonMisiewicz dismissed their stale review October 11, 2022 19:00

Too many changes since original review

@tallakahath
Copy link
Contributor Author

Are the Grid classes used for something other than printing?

Grid classes are only used for printing. Hence why I moved things to skinnier structs, so that it's clear to future devs that "nothing goes here but storage" (and maybe even with a big comment about "THIS CONTENT IS NOT REFERENCED EXCEPT BY A PRINT". There's no reasonable way I can see to migrate this into the MassPoints.

I've been quiet b/c I am hoping the other primary devs will chime in here -- I'm not sure how to move forwards given conflicting desires (and given that the immediate urgency of "this causes program hangs" has been fixed).

@hokru
Copy link
Member

hokru commented Oct 26, 2022

This print_details is more a debug printing for devs than for users. The quadrature points themselves can be printed with the BlockOPoints objects if needed. The r and wr are intermediates.

I would be in favour of removing the classes at the cost of removing the detailed printing.
It's better to re-think how to print this information instead of trying to fit it into these classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lebedev_mapping_ access is thread-unsafe (and also never initialized?)
5 participants