Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openmpi osc_ucx_component error #12517

Open
SLgramps opened this issue May 3, 2024 · 4 comments
Open

openmpi osc_ucx_component error #12517

SLgramps opened this issue May 3, 2024 · 4 comments
Assignees
Milestone

Comments

@SLgramps
Copy link

SLgramps commented May 3, 2024

Thank you for taking the time to submit an issue!

Background information

I have been building/running some simple parallel code (specifically, from https://github.com/modern-fortran, any of the example projects such as tsunami or weather-buoys). When using openmpi v4.1.5 (Fedora 39), everything built/ran as expected. Upon upgrading to Fedora 40, openmpi was also upgraded to v5.0.2 and the code, though running correctly and finishing, issues copious amounts of osc_ucx_component.c:369 errors as shown in the attached screenshot. Downgrading openmpi back to v4.1.5 eliminates the errors.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v4.1.5 (works correctly)
v5.0.2 (issues osc_ucx_component errors)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

packages installed from fedora repositories, OpenCoarrays downloaded from opencoarrays.org and built locally
Screenshot_20240430_142440

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: linux kernel 6.8.7 and several predecessors
  • Computer hardware: Intel i7 12700KF, Asus Z-690 motherboard
  • Network type:

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Using v5.0.2, code runs but issues copious osc_ucx_component.c:369 errors as shown in the attached screenshot
Using v4.1.5, code runs without issuing errors
Does not seem to affected by version of linux, gfortran, or OpenCoarrays used

  1. install gfortran from fedora repositories, build OpenCoarrays from opencoarrays.org
  2. git clone, for instance, https://github.com/modern-fortran/weather-bouys (or tsunami, ch08) and build using included makefiles with either v4.1.5 or v5.02 openmpi
  3. run parallel versions of code using cafrun -n 2 (or other number of cores)

(Attached screenshot is from running tsunami/ch08. Code from weather-buoys also exhibits this behavior.)

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world
@devreal devreal added the bug label May 3, 2024
@devreal devreal added this to the v5.0.4 milestone May 3, 2024
@devreal
Copy link
Contributor

devreal commented May 3, 2024

@janjust Looks like this might be triggered by failed initialization of ucp. The message is definitely misleading.

@janjust
Copy link
Contributor

janjust commented May 6, 2024

yeah look that way, @SLgramps , I'm guessing this is a local machine, no network?

@janjust
Copy link
Contributor

janjust commented May 6, 2024

as a simple work around just disabling the component will get your around this error -mca osc ^ucx

@SLgramps
Copy link
Author

SLgramps commented May 6, 2024

Yes, this was a local machine. Your suggested workaround eliminates the error. Thanks for looking into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants