Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure on systems without SSH #152

Open
folmos-at-orange opened this issue Apr 19, 2024 · 6 comments
Open

Failure on systems without SSH #152

folmos-at-orange opened this issue Apr 19, 2024 · 6 comments
Labels

Comments

@folmos-at-orange
Copy link

Comment:

I tried to run an mpi application on a basic Rocky Linux 9 container with the conda-forge openmpi package. The application failed with the error message below. This error seems to be because the ssh client was not found.

  • Should openmpi depend on openssh ? The Ubuntu and Rocky Linux packages do.

Error message

--------------------------------------------------------------------------
The value of the MCA parameter "plm_rsh_agent" was set to a path
that could not be found:

  plm_rsh_agent: ssh : rsh

Please either unset the parameter, or check that the path is correct
--------------------------------------------------------------------------
[ba7c70732e0d:00625] [[INVALID],INVALID] FORCE-TERMINATE AT Not found:-13 - error plm_rsh_component.c(335)
[ba7c70732e0d:00624] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 716
[ba7c70732e0d:00624] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 172
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[ba7c70732e0d:00624] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
@dalcinl
Copy link
Contributor

dalcinl commented Apr 19, 2024

I'm not really sure. The dependency actually just at runtime one, I've raised this issue to Open MPI folks, I don't remember the status.

The thing is, Open MPI does not strictly need ssh if running in a single compute node or workstation or laptop. Open MPI is eagerly looking for ssh and failing if not found, even if ssh is never going to be used.

For example, in my CI tests, if running under a small Docker image without ssh by default, I just do the following:

export OMPI_MCA_plm_ssh_agent=false

Here, ssh_agent means the path to the ssh command (it has nothing to do with the usual SSH agent related to ssh -A ). By setting the ssh command to false , Open MPI can safely continue running on a single node, and if an attempt is ever made to actually use ssh, as it is the false command, it will simply fail and you should somehow notice the issue.

In my own particular and biased opinion, I hate dependencies that are not strictly needed, and I'm generally not in favor of forcibly adding such dependencies as default ones. I'm totally fine with deps marked as optional or recommended, or mechanisms such pip install package[feature1,feature2], but I don't think conda support any of such mechanism. Or am I wrong?

Additionally, I'm a bit afraid of the consequences for users inadvertently installing the conda-forge openssh package in their conda environment, and from there any ssh command invocation will be the one from conda instead of the one from the system.

All that being said, if the rest of the Open MPI conda-forge user community feels that in this particular case it makes sense to add openssh as dependency of openmpi, you guys will not hear additional words from me objecting the decision. Of course, if any issue ever occurs because of such change, I'll simply roll my eyes with a sad smile, immediately unsubscribe from any issue/PR related to the problem, and happily keep going with my short and intranscendental human life.

@folmos-at-orange
Copy link
Author

@dalcinl thanks for your quick answer.

I agree that adding openssh as a dependency is too much as it is strictly not needed. I wonder if the OpenMPI guys cannot fallback to single node in case of not finding ssh, mpich doesn't have this problem.

I'll take your workaround for my CI problems and as well for the execution of the packaged program I'm working. I was thinking adding openssh as a requirement for my package, but setting the env var is a better solution.

@dalcinl
Copy link
Contributor

dalcinl commented Apr 19, 2024

I wonder if the OpenMPI guys cannot fallback to single node in case of not finding ssh

That would be definitive solution, indeed. However, look here open-mpi/ompi#12386, although I'm a bit confused about this comment: open-mpi/ompi#12386 (comment). After reading all of that issue, at this point I'm not really sure what's the actual intended behavior. Maybe you should ask for a clarification: Is the absence of ssh supposed to fail, even if running on a single node (or without an allocation)?

@njzjz
Copy link
Member

njzjz commented May 2, 2024

export OMPI_MCA_plm_ssh_agent=false

We encountered the same error in conda-forge/ambertools-feedstock#133 when testing the conda package with OpenMPI. Is it recommended to set this environment variable during the test? The conda-forge documentation (https://conda-forge.org/docs/maintainer/knowledge_base/#message-passing-interface-mpi) may need to add this information.

@dalcinl
Copy link
Contributor

dalcinl commented May 3, 2024

@njzjz Whether to set the variable or not depends on whether you have the ssh command or not. If you are running on a minimal container environment without the ssh command, then you either install openssh with the package manager in the container, or you set the variable to workaround the issue.

PS: Regarding conda-forge documentation, I personally have no involvement with any of that. I guess you can raise an issue or submit a pull request with whatever clarification you consider appropriate.

@leofang
Copy link
Member

leofang commented May 3, 2024

PS: Regarding conda-forge documentation, I personally have no involvement with any of that. I guess you can raise an issue or submit a pull request with whatever clarification you consider appropriate.

Feel free to open an issue in https://github.com/conda-forge/conda-forge.github.io.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants