-
Notifications
You must be signed in to change notification settings - Fork 843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use OMPI without LSF integration on LSF #12556
Comments
This has nothing to do with Error: system limit exceeded on number of files that can be open Node:
sqg6e31 You need to increase the limit on the number of files that can be open, just like it says:
|
I have tried to set with I am now trying to fix LSF integration for OMPI in Spack, but strangely by just adding |
And I can now also confirm that after
runs just fine - no need to set any mca parameters or changing system limits. I am trying to fix the packages that broke downstream. I suspect this is because I haven't properly fixed the Spack package, so it's not clear whether I will succeed as this breaks all packages downstream that depend on MPI. |
@robertsawko Please keep us posted on new issues |
Absolutely, I do want to get to the bottom of it. I only just got access to another LSF cluster. My main one is actually down until tomorrow, possibly later, so I couldn't look into it over this week. |
Ah, sorry, I should have said - I started a Spack issue on LSF LIBDIR here, but also last week my LSF cluster went into a week long maintenance so I didn't have computer to test on. |
I confess to being puzzled as to how the LSF libdir can impact the MPI stack (outside of mpirun itself). Nothing in MPI depends on or integrates with LSF. |
Thanks, @rhc54 - you may be right. Maybe LSF is a red herring... When I compile OMPI manually I add all sort of switches:
and my Spack spec was pretty basic:
So I need to test that. Specifically, I need to test that adding knem and mxm rather than relying on |
Yes, the LSF integration may be a red herring. I am sorry. It looks like the error is caused by adding
which I misread as being in the wrong directory. Now I can see clearly that all ranks were indeed starting in the correct working directiroy and this error has something to do with the environment. For instance here they're discussing it in the context of a container and running as root. There's a few posts with people trying to run as root, but that's not the case for me, so I am not sure what I've done wrong here. I am checking this more carefully now. |
Hmm... looks like the source of my problem is some inconsistency of the environment. If I run without -x flags, non-launch nodes are unaware of Spack. If I add the basic ones line PATH and LD_LIBRARY_PATH, OpenFOAM think I am running in root or something equivalent. I am trying to devise a sensible wrapper... |
After many trials and not so many tribulations I managed to produce a wrapper which reproduces the Spack environment consistently across all nodes. I am happy for this to be closed, but could you please advise if there's any better way to propagate environment across all nodes? Maybe LSF integration was doing just that? When I was running this:
nothing but the launch node would know about my Spack environment. Adding naively, |
Sorry, one more comment as it is pertinent to my original question. I've run some more scripts and I can confirm that with LSF integration I've got launch node environment fully reproduced on all other nodes, whereas without LSF integration PATH et al are set to system defaults. So this has been the source of my misery all along. |
LSF automatically forwards your entire environment. However, ssh does not - so when launching via ssh, your environment will not get forwarded. Easiest way around that is to add the key envars to your login shell script (e.g., .bashrc). Trying to forward the entire environment under ssh would be problematic as there are limits to the size of the overall ssh string. So the only alternative solution is to ask that the user specify which envars should be forwarded. |
I am going to close this as this is really a solved problem (wrapper) and fix the Spack package. |
I believe this should be relatively simple, but I am struggling to find the right combination of switches.
My target application is quite complex: OpenFOAM, ParaView + Catalyst V2 with OSMESA both using OpenMPI v5. I've used Spack to build it on x86 RHEL7 cluster. Unfortunately, Spack OpenMPI package doesn't support lsf-libdir option which on this cluster is required to correctly build OpenMPI with LSF integration. So I ended up with all my stack built but no LSF integration. Lawless land it seems.
I've already tested my setup for small jobs and now I am about to launch a medium size job: 512 ranks, spanned over 16 nodes, 32 core per node and 1 rank per physical core.
but I still end up seeing the following error.
Please advise if there's anything I could improve in my
mpirun
invocation. The displayed mapping looks correct, but clearly before binding(?) something goes very wrong.Also, if you think it's impossible to correctly call mpirun for medium and large jobs without LSF integration, then I am happy to focus on fixing the Spack package. I was meaning to do it for a while.
The text was updated successfully, but these errors were encountered: