Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3rd-party: bump openpmix submodule #12532

Closed
wants to merge 1 commit into from

Conversation

wenduwan
Copy link
Contributor

@wenduwan wenduwan commented May 8, 2024

Track upstream master branch

@wenduwan
Copy link
Contributor Author

wenduwan commented May 8, 2024

--> Running example: hello_c
+ timeout -s SIGSEGV 4m mpirun --get-stack-traces --timeout 180 --hostfile /home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12532/hostfile -np 2 --bind-to none ./examples/hello_c
--------------------------------------------------------------------------
PMIx was unable to find a usable compression library
on the system. We will therefore be unable to compress
large data streams. This may result in longer-than-normal
startup times and larger memory footprints. We will
continue, but strongly recommend installing zlib or
a comparable compression library for better user experience.

You can suppress this warning by adding "pcompress_base_silence_warning=1"
to your PMIx MCA default parameter file, or by adding
"PMIX_MCA_pcompress_base_silence_warning=1" to your environment.
--------------------------------------------------------------------------
Hello, world, I am 0 of 2, (Open MPI v5.1.0a1, package: Open MPI ec2-user@ip-172-31-9-81.us-west-2.compute.internal Distribution, ident: 5.1.0a1, repo rev: v2.x-dev-11395-g7b821da, Unreleased developer copy, 179)
Hello, world, I am 1 of 2, (Open MPI v5.1.0a1, package: Open MPI ec2-user@ip-172-31-9-81.us-west-2.compute.internal Distribution, ident: 5.1.0a1, repo rev: v2.x-dev-11395-g7b821da, Unreleased developer copy, 179)
+ ret=0
+ test 0 -ne 0
+ run_example 'timeout -s SIGSEGV 1m ' ./examples/hello_c
++ basename ./examples/hello_c
+ example=hello_c
+ echo '--> Running example: hello_c'
--> Running example: hello_c
+ timeout -s SIGSEGV 1m ./examples/hello_c
[ip-172-31-9-81:30012] *** Process received signal ***
[ip-172-31-9-81:30012] Signal: Segmentation fault (11)
[ip-172-31-9-81:30012] Signal code: Address not mapped (1)
[ip-172-31-9-81:30012] Failing at address: 0xf6efb9e
[ip-172-31-9-81:30012] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f7f18947630]
[ip-172-31-9-81:30012] [ 1] /lib64/libc.so.6(_IO_vfprintf+0x4a79)[0x7f7f185b7079]
[ip-172-31-9-81:30012] [ 2] /lib64/libc.so.6(vasprintf+0xa3)[0x7f7f185e1e73]
[ip-172-31-9-81:30012] [ 3] /home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12532/install/lib/libpmix.so.2(pmix_vasprintf+0x9)[0x7f7f17d1b929]
[ip-172-31-9-81:30012] [ 4] /home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12532/install/lib/libpmix.so.2(pmix_asprintf+0x87)[0x7f7f17d1b9c7]
[ip-172-31-9-81:30012] [ 5] /home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12532/install/lib/libpmix.so.2(PMIx_Init+0x2581)[0x7f7f17c7a2f1]
[ip-172-31-9-81:30012] [ 6] /home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12532/install/lib/libmpi.so.0(ompi_rte_init+0x14f)[0x7f7f18bde5cf]
[ip-172-31-9-81:30012] [ 7] /home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12532/install/lib/libmpi.so.0(+0x935ac)[0x7f7f18be75ac]
[ip-172-31-9-81:30012] [ 8] /home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12532/install/lib/libmpi.so.0(ompi_mpi_instance_init+0x5b)[0x7f7f18be835b]
[ip-172-31-9-81:30012] [ 9] /home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12532/install/lib/libmpi.so.0(ompi_mpi_init+0xe1)[0x7f7f18bdb681]
[ip-172-31-9-81:30012] [10] /home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12532/install/lib/libmpi.so.0(MPI_Init+0x9b)[0x7f7f18c0b71b]
[ip-172-31-9-81:30012] [11] ./examples/hello_c[0x40086e]
[ip-172-31-9-81:30012] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f7f1858c555]
[ip-172-31-9-81:30012] [13] ./examples/hello_c[0x400779]
[ip-172-31-9-81:30012] *** End of error message ***
.ci/community-jenkins/pr-builder.sh: line 278: 30011 Segmentation fault      ${1} ${2}
+ ret=139
+ test 139 -ne 0
+ echo 'Example failed: 139'
Example failed: 139
+ echo 'Command was: timeout -s SIGSEGV 1m  ./examples/hello_c'
Command was: timeout -s SIGSEGV 1m  ./examples/hello_c
+ exit 139

The failure looks real. @rhc54 This happens on openpmix master branch. Does it ring any bell?

@wenduwan
Copy link
Contributor Author

wenduwan commented May 8, 2024

bot:aws:retest

@rhc54
Copy link
Contributor

rhc54 commented May 8, 2024

This happens on openpmix master branch. Does it ring any bell?

Not really - I just tested it on my Docker cluster and it works fine.

@rhc54
Copy link
Contributor

rhc54 commented May 8, 2024

If you can configure with --enable-debug, maybe we can get a little more insight?

@rhc54
Copy link
Contributor

rhc54 commented May 8, 2024

Also, I don't really understand that output. What does it mean that the hello_c app apparently ran, got thru MPI_Init, printed out its "hello" message - and then I see a bunch of output ending in a segfault?? How can the procs have segfaulted in MPI_Init if we are seeing all their output, which clearly indicates they got past MPI_Init just fine?

@wenduwan
Copy link
Contributor Author

wenduwan commented May 9, 2024

@rhc54 Thanks Ralph. I need to poke this more. Will let you know.

@hppritcha
Copy link
Member

rather odd, if i build with --enable-debug and run mpi4py by hand singleton doesn't fail.

@rhc54
Copy link
Contributor

rhc54 commented May 10, 2024

Is it --enable-debug that makes the difference? Or is it running mpi4py by hand? Or does it require both (which would really seem weird)?

@rhc54
Copy link
Contributor

rhc54 commented May 10, 2024

I found a bug in PRRTE's modex procedure - it might be contributing here (enabling debug would have made a difference to what I found)? Anyway, you might need to pull that around. Also did a little cleanup in PMIx, though I'm not sure if that would contribute to what you are seeing either.

Should have included the relevant links:

openpmix/openpmix#3344
openpmix/prrte#1980

@wenduwan
Copy link
Contributor Author

Test new patches from openpmix project

@rhc54
Copy link
Contributor

rhc54 commented May 10, 2024

Fix for singletons is here: openpmix/openpmix#3345

@wenduwan
Copy link
Contributor Author

Switched to Ralph's pmix branch for testing.

@rhc54
Copy link
Contributor

rhc54 commented May 10, 2024

FWIW: I believe some of these tests are failing because you are patching the local 3rd-party code instead of advancing a submodule pointer. The problem is that the tests are trying to recursively clone the repo in your branch, and they cannot do that if the branch isn't tied to a commit.

If you are pointing your submodule at my branch, be aware that my branch gets deleted once the PR is committed. Best to just wait for commit and then advance the submodule to the head of master.

@wenduwan
Copy link
Contributor Author

I don't remember seeing this though.

ERROR: testCreateFromGroup (test_comm.TestCommWorld.testCreateFromGroup)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ompi/ompi/test/test_comm.py", line 176, in testCreateFromGroup
    comm = MPI.Intracomm.Create_from_group(group)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/mpi4py/MPI.src/Comm.pyx", line 2210, in mpi4py.MPI.Intracomm.Create_from_group
mpi4py.MPI.Exception: MPI_ERR_UNKNOWN: unknown error

======================================================================
ERROR: testCreateFromGroup (test_comm.TestCommWorldDup.testCreateFromGroup)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ompi/ompi/test/test_comm.py", line 176, in testCreateFromGroup
    comm = MPI.Intracomm.Create_from_group(group)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/mpi4py/MPI.src/Comm.pyx", line 2210, in mpi4py.MPI.Intracomm.Create_from_group
mpi4py.MPI.Exception: MPI_ERR_UNKNOWN: unknown error

----------------------------------------------------------------------

@rhc54
Copy link
Contributor

rhc54 commented May 10, 2024

Not familiar with that function, but I can try to take a look. If it involves PMIx_Group_construct, I'm working on that now and can see if there is anything relevant.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
@wenduwan
Copy link
Contributor Author

@rhc54 IIRC you mentioned some WIP in pmix that might fix the create group failure. Is that something we can test?

@wenduwan
Copy link
Contributor Author

AWS CI also failed. Many OMB/IMB benchmarks did not start.

@hppritcha
Copy link
Member

yes the failures are due to some group construct issue.the mpi create from group(s) methods use pmix group construct/destruct methods.

@rhc54
Copy link
Contributor

rhc54 commented May 16, 2024

Yeah, I mentioned this at the RM meeting earlier this week. It will take me a while to fix - got a lot going on right now. Issue isn't in PMIx, but rather in PRRTE.

@wenduwan
Copy link
Contributor Author

Sorry, Ralph is correct. He said the issue was in prrte.

hppritcha added a commit to hppritcha/ompi that referenced this pull request May 21, 2024
advance sha to e32e0179bc.

related to open-mpi#12532

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@wenduwan
Copy link
Contributor Author

Closing in favor of #12565

@wenduwan wenduwan closed this May 23, 2024
hppritcha added a commit to hppritcha/ompi that referenced this pull request May 30, 2024
advance sha to e32e0179bc.

related to open-mpi#12532

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mpi4py-all Run the optional mpi4py CI tests Target: main
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants