Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG, MAINT: segfaults through libfabric->ucx #10001

Closed
tylerjereddy opened this issue Apr 19, 2024 · 33 comments
Closed

BUG, MAINT: segfaults through libfabric->ucx #10001

tylerjereddy opened this issue Apr 19, 2024 · 33 comments
Labels

Comments

@tylerjereddy
Copy link
Contributor

I'm seeing a segfault/backtrace for NVSHMEM -> libfabric -> ucx control flow for a 2-node test run of GROMACS on one of our supercomputers with OpenMPI 5.0.2 on Cray Slingshot 11. I think what I'm really looking for is clear runtime error messages that tell me what is wrong (API, ABI, whatever version mismatches, etc.) before I ever get to a segfault. I've labelled this a bug on the sole basis that I shouldn't be able to segfault, but it could be that the error resides with i.e., the use of fi_getinfo() "upstream" of the segfault happening (i.e., that NVSHMEM should handle their runtime check differently?).

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

[nid001225:49728:0:49728] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x18)
==== backtrace (tid:  49728) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(ucs_handle_error+0x294) [0x14d6611e4394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x30564) [0x14d6611e4564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x3082e) [0x14d6611e482e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x14d6774918c0]
 4  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.20.1-ftn27snaykegoaq57c2e4khavl6jfzy7/lib/libfabric.so.1(fi_dupinfo+0x35b) [0x14d5d87d3bab]
 5  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.20.1-ftn27snaykegoaq57c2e4khavl6jfzy7/lib/libfabric.so.1(fi_dupinfo+0x206) [0x14d5d87d9ac6]
 6  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.20.1-ftn27snaykegoaq57c2e4khavl6jfzy7/lib/libfabric.so.1(fi_getinfo+0x2c) [0x14d5d87d9b9c]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0xe8b) [0x14d5d8880fcb]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14d666414c89]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14d66641a01c]
10  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14d66641be99]
11  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14d66641c30e]
12  /lustre/scratch5/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(+0x52e635) [0x14d66878b635]
13  /lustre/scratch5/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(+0x11b4c3) [0x14d6683784c3]
14  /lustre/scratch5/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14d6683a4450]
15  /lustre/scratch5/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14d6683a5105]
16  /lustre/scratch5/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14d6683a2d7d]
17  /lustre/scratch5/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14d6683a31bf]
18  /lustre/scratch5/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14d6683a3386]
19  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Gpu3dFft11ImplCuFftMpC2EbP19ompi_communicator_tNS_8ArrayRefIKiEES6_ibRK13DeviceContextRK12DeviceStreamPiSD_SD_PPfSF_+0xc89) [0x14d6795edaa9]
20  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Gpu3dFftC1ENS_10FftBackendEbP19ompi_communicator_tNS_8ArrayRefIKiEES6_ibRK13DeviceContextRK12DeviceStreamPiSD_SD_PPfSF_+0x193) [0x14d6795eea93]
21  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z20pme_gpu_reinit_3dfftPK6PmeGpu+0x53b) [0x14d6795e945b]
22  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z14pme_gpu_reinitP9gmx_pme_tPK13DeviceContextPK12DeviceStreamPK13PmeGpuProgramb+0x21b) [0x14d6795e9e8b]
23  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z12gmx_pme_initPK9t_commrecRK13NumPmeDomainsPK10t_inputrecPA3_Kffbbbffi10PmeRunModeP6PmeGpuPK13DeviceContextPK12DeviceStreamPK13PmeGpuProgramRKN3gmx8MDLoggerE+0xafd) [0x14d67943da0d]
24  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Mdrunner8mdrunnerEv+0x646e) [0x14d6794bf31e]
25  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x408f17]
26  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x409029]
27  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx24CommandLineModuleManager3runEiPPc+0x33a) [0x14d678cd660a]
28  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x405bcc]
29  /lib64/libc.so.6(__libc_start_main+0xef) [0x14d6770bb29d]
30  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x405c3a]
=================================

I've talked to NVIDIA engineers about this, and the problem really isn't clear to them. I did some experiments with runtime swapping of libfabric versions. NVSHMEM was built from source against libfabric 1.20.1 from spack.

Here is what happens if I use libfabric 1.18.1 at runtime instead: /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available and then a segfault at ucx again.

The local C++ code they're using looks like this:

1521     status = fi_getinfo(FI_VERSION(NVSHMEMT_LIBFABRIC_MAJ_VER, NVSHMEMT_LIBFABRIC_MIN_VER), NULL,
1522                         NULL, 0, &info, &returned_fabrics);
1523                 
1524     NVSHMEMI_NZ_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out,
1525                           "No providers matched fi_getinfo query: %d: %s\n", status,
1526                           fi_strerror(status * -1)); 

where those two version variables are set to 1 and 5, respectively. I know I've had some success in the past using libfabric 1.18.1 if I build NVSHMEM against that directly and use it at runtime. Is there a good reason I wouldn't be able to use 1.20.1, and if so how should the NVSHMEM folks guard against it?

@tylerjereddy
Copy link
Contributor Author

tylerjereddy commented Apr 19, 2024

Also, pretty sure that in this case I rebuilt NVSHMEM against 1.18.1 and still had the same problem, so it isn't quite that simple for me to work around (I did change some other things like newer OpenMPI, I was on a release candidate of 5.x series prior, and using newer NVSHMEM now).

Probably the most useful thing is this--do I have a shot at debugging this/fixing this without basically needing to rebuild my whole dependency chain? If I could just adjust LD_LIBRARY_PATH to somehow fix this that would be amazing, but I suspect I won't be so lucky.

@tylerjereddy
Copy link
Contributor Author

tylerjereddy commented Apr 22, 2024

Clean rebuild of dependency chain with libfabric at 1.18.1, OpenMPI at v5.0.3, and NVSHMEM at 2.10.1 produces a similar backtrace:

nid001481:109123:109123 [3] NVSHMEM INFO [1] heap base: 0x14c9e0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001481:109123:109123 [3] NVSHMEM INFO [1] mspace ptr: 0x14cba6908340
nid001480:5921:5921 [3] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

nid001481:109123:109123 [3] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

[nid001481:109123:0:109123] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid: 109123) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x14cb9f892394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x14cb9f892564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14cb9f89282e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x14cbb64cb8c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14cb2022f9fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x14cb20231312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14cba4ac2c89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14cba4ac801c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14cba4ac9e99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14cba4aca30e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(+0x52e635) [0x14cba6e39635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(+0x11b4c3) [0x14cba6a264c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14cba6a52450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14cba6a53105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14cba6a50d7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14cba6a511bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14cba6a51386]
17  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Gpu3dFft11ImplCuFftMpC2EbP19ompi_communicator_tNS_8ArrayRefIKiEES6_ibRK13DeviceContextRK12DeviceStreamPiSD_SD_PPfSF_+0xc89) [0x14cbb7c9c029]
18  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Gpu3dFftC1ENS_10FftBackendEbP19ompi_communicator_tNS_8ArrayRefIKiEES6_ibRK13DeviceContextRK12DeviceStreamPiSD_SD_PPfSF_+0x193) [0x14cbb7c9d013]
19  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z20pme_gpu_reinit_3dfftPK6PmeGpu+0x53b) [0x14cbb7c979bb]
20  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z14pme_gpu_reinitP9gmx_pme_tPK13DeviceContextPK12DeviceStreamPK13PmeGpuProgramb+0x21b) [0x14cbb7c983eb]
21  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z12gmx_pme_initPK9t_commrecRK13NumPmeDomainsPK10t_inputrecPA3_Kffbbbffi10PmeRunModeP6PmeGpuPK13DeviceContextPK12DeviceStreamPK13PmeGpuProgramRKN3gmx8MDLoggerE+0xafd) [0x14cbb7aebf5d]
22  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Mdrunner8mdrunnerEv+0x646e) [0x14cbb7b6d87e]
23  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x409027]
24  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x409139]
25  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx24CommandLineModuleManager3runEiPPc+0x33a) [0x14cbb73846ba]
26  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x405cdc]
27  /lib64/libc.so.6(__libc_start_main+0xef) [0x14cbb576929d]
28  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x405d4a]
=================================
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

[nid001480:5921 :0:5921] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid:   5921) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x147123185394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x147123185564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14712318582e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x147139dbe8c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x1470a48249fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x1470a4826312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x1471283b5c89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x1471283bb01c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x1471283bce99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x1471283bd30e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(+0x52e635) [0x14712a72c635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(+0x11b4c3) [0x14712a3194c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14712a345450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14712a346105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14712a343d7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14712a3441bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14712a344386]
17  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Gpu3dFft11ImplCuFftMpC2EbP19ompi_communicator_tNS_8ArrayRefIKiEES6_ibRK13DeviceContextRK12DeviceStreamPiSD_SD_PPfSF_+0xc89) [0x14713b58f029]
18  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Gpu3dFftC1ENS_10FftBackendEbP19ompi_communicator_tNS_8ArrayRefIKiEES6_ibRK13DeviceContextRK12DeviceStreamPiSD_SD_PPfSF_+0x193) [0x14713b590013]
19  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z20pme_gpu_reinit_3dfftPK6PmeGpu+0x53b) [0x14713b58a9bb]
20  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z14pme_gpu_reinitP9gmx_pme_tPK13DeviceContextPK12DeviceStreamPK13PmeGpuProgramb+0x21b) [0x14713b58b3eb]
21  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z12gmx_pme_initPK9t_commrecRK13NumPmeDomainsPK10t_inputrecPA3_Kffbbbffi10PmeRunModeP6PmeGpuPK13DeviceContextPK12DeviceStreamPK13PmeGpuProgramRKN3gmx8MDLoggerE+0xafd) [0x14713b3def5d]
22  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Mdrunner8mdrunnerEv+0x646e) [0x14713b46087e]
23  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x409027]
24  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x409139]
25  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx24CommandLineModuleManager3runEiPPc+0x33a) [0x14713ac776ba]
26  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x405cdc]
27  /lib64/libc.so.6(__libc_start_main+0xef) [0x14713905c29d]
28  /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x405d4a]
=================================
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------

pmix is at 4.2.9.

@tylerjereddy
Copy link
Contributor Author

Changing this from version 5 to 18 in the NVSHMEM source had no effect, same errors:
src/modules/transport/libfabric/libfabric.h:#define NVSHMEMT_LIBFABRIC_MIN_VER 18 (which gets called by FI_VERSION).

@tylerjereddy
Copy link
Contributor Author

I've simplified the reproducer to remove GROMACS entirely, using only the cuFFTMp example at: https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/samples/r2c_c2r_slabs_GROMACS.

Interactive run script for 2 nodes (4 A100 GPUs each)

#!/bin/bash -l
#

# setup the runtime environment
export NVSHMEM_DEBUG=TRACE
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.18.1-opv2jutclmudyzxdeud4xjggqrubip3u/lib:$LD_LIBRARY_PATH"
export PATH="$PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin"
export PATH="$PATH:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin"
export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/ucx:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib"
export NVSHMEM_DISABLE_CUDA_VMM=1
export FI_CXI_OPTIMIZED_MRS=false
export NVSHMEM_REMOTE_TRANSPORT=libfabric
export MPI_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install
export CUFFT_LIB=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib
export CUFFT_INC=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp
export NVSHMEM_LIB=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib
export NVSHMEM_INC=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include

cd /lustre/scratch5/treddy/march_april_2024_testing/github_projects/CUDALibrarySamples/cuFFTMp/samples/r2c_c2r_slabs_GROMACS
make clean
make build
make run

Diff on above Makefile:

diff --git a/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile b/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile
index 5d9fa3e..64e39be 100644
--- a/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile
+++ b/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile
@@ -15,4 +15,4 @@ $(exe): $(exe).cu
 build: $(exe)
 
 run: $(exe)
-       LD_LIBRARY_PATH="${NVSHMEM_LIB}:${CUFFT_LIB}:${LD_LIBRARY_PATH}" mpirun -oversubscribe -n 4 $(exe) 
+       LD_LIBRARY_PATH="${NVSHMEM_LIB}:${CUFFT_LIB}:${LD_LIBRARY_PATH}" /lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin/mpirun -oversubscribe -n 8 -N 4 $(exe)

Output:

rm -rf cufftmp_r2c_c2r_slabs_GROMACS
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin/../bin/nvcc cufftmp_r2c_c2r_slabs_GROMACS.cu -o cufftmp_r2c_c2r_slabs_GROMACS -std=c++17 --generate-code arch=compute_70,code=sm_70 --generate-code arch=compute_80,code=sm_80 --generate-code arch=compute_90,code=sm_90 -I/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp -I/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include -I/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/include -lcuda -L/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib -L/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib  -lcufftMp -lnvshmem_device -lnvshmem_host -L/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib -lmpi
LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib:/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/ucx:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.18.1-opv2jutclmudyzxdeud4xjggqrubip3u/lib:/opt/cray/pe/papi/7.0.0.2/lib64:/opt/cray/libfabric/1.15.2.0/lib64:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib" /lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin/mpirun -oversubscribe -n 8 -N 4 cufftmp_r2c_c2r_slabs_GROMACS
Hello from rank 7/8 using GPU 3
Hello from rank 6/8 using GPU 2
Hello from rank 5/8 using GPU 1
Hello from rank 4/8 using GPU 0
Hello from rank 2/8 using GPU 2
Hello from rank 3/8 using GPU 3
Hello from rank 0/8 using GPU 0
Hello from rank 1/8 using GPU 1
NVSHMEM configuration:
  CUDA API                     12030
  CUDA Runtime                 12030
  CUDA Driver                  12000
  Build Timestamp              Apr 22 2024 12:58:35
  Build Variables             
	NVSHMEM_DEBUG=OFF NVSHMEM_DEVEL=OFF NVSHMEM_DEFAULT_PMI2=OFF
	NVSHMEM_DEFAULT_PMIX=OFF NVSHMEM_DEFAULT_UCX=OFF NVSHMEM_DISABLE_COLL_POLL=ON
	NVSHMEM_ENABLE_ALL_DEVICE_INLINING=OFF NVSHMEM_ENV_ALL=OFF
	NVSHMEM_GPU_COLL_USE_LDST=OFF NVSHMEM_IBGDA_SUPPORT=OFF
	NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=OFF NVSHMEM_IBDEVX_SUPPORT=OFF
	NVSHMEM_IBRC_SUPPORT=ON NVSHMEM_LIBFABRIC_SUPPORT=ON NVSHMEM_MPI_SUPPORT=ON
	NVSHMEM_NVTX=ON NVSHMEM_PMIX_SUPPORT=OFF NVSHMEM_SHMEM_SUPPORT=OFF
	NVSHMEM_TEST_STATIC_LIB=OFF NVSHMEM_TIMEOUT_DEVICE_POLLING=OFF NVSHMEM_TRACE=OFF
	NVSHMEM_UCX_SUPPORT=OFF NVSHMEM_USE_DLMALLOC=OFF NVSHMEM_USE_NCCL=OFF
	NVSHMEM_USE_GDRCOPY=ON NVSHMEM_VERBOSE=OFF
	CUDA_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3
	GDRCOPY_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/gdrcopy-2.3-ftyzikjaithdoznahhzpuecguynyqqyv
	LIBFABRIC_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.18.1-opv2jutclmudyzxdeud4xjggqrubip3u
	nid001233:91419:91419 [3] NVSHMEM INFO PE 3 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
nid001233:91417:91417 [1] NVSHMEM INFO PE 1 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
nid001233:91418:91418 [2] NVSHMEM INFO PE 2 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
MPI_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install
	NCCL_HOME=/usr/local/nccl
	NVSHMEM_PREFIX=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install
	PMIX_HOME=/usr
	SHMEM_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install
	UCX_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z

nid001233:91416:91416 [0] NVSHMEM INFO PE 0 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
nid001233:91417:91417 [1] NVSHMEM INFO cudaDriverVersion 12000
nid001233:91417:91417 [1] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001233:91417:91417 [1] NVSHMEM INFO [1] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001233:91417:91417 [1] NVSHMEM INFO in get_cucontext, queried and saved context for device: 1 context: 0x24444b0
nid001233:91418:91418 [2] NVSHMEM INFO cudaDriverVersion 12000
nid001233:91418:91418 [2] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001233:91418:91418 [2] NVSHMEM INFO [2] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001233:91418:91418 [2] NVSHMEM INFO in get_cucontext, queried and saved context for device: 2 context: 0x243c0f0
nid001233:91416:91416 [0] NVSHMEM INFO cudaDriverVersion 12000
nid001233:91419:91419 [3] NVSHMEM INFO cudaDriverVersion 12000
nid001233:91416:91416 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001233:91416:91416 [0] NVSHMEM INFO [0] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001233:91416:91416 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x244c870
nid001233:91419:91419 [3] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001233:91419:91419 [3] NVSHMEM INFO [3] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001233:91419:91419 [3] NVSHMEM INFO in get_cucontext, queried and saved context for device: 3 context: 0x2433d30
nid001233:91417:91417 [1] NVSHMEM INFO [1] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c32c0
nid001233:91418:91418 [2] NVSHMEM INFO [2] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c36a0
nid001233:91416:91416 [0] NVSHMEM INFO [0] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3fe0
nid001233:91419:91419 [3] NVSHMEM INFO [3] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c2ff0
nid001417:92704:92704 [3] NVSHMEM INFO PE 7 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
nid001417:92703:92703 [2] NVSHMEM INFO PE 6 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
nid001417:92701:92701 [0] NVSHMEM INFO PE 4 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
nid001417:92702:92702 [1] NVSHMEM INFO PE 5 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
nid001417:92703:92703 [2] NVSHMEM INFO cudaDriverVersion 12000
nid001417:92703:92703 [2] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001417:92703:92703 [2] NVSHMEM INFO [6] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001417:92703:92703 [2] NVSHMEM INFO in get_cucontext, queried and saved context for device: 2 context: 0x243bfe0
nid001417:92704:92704 [3] NVSHMEM INFO cudaDriverVersion 12000
nid001417:92704:92704 [3] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001417:92702:92702 [1] NVSHMEM INFO cudaDriverVersion 12000
nid001417:92702:92702 [1] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001417:92702:92702 [1] NVSHMEM INFO [5] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001417:92702:92702 [1] NVSHMEM INFO in get_cucontext, queried and saved context for device: 1 context: 0x24443a0
nid001417:92701:92701 [0] NVSHMEM INFO cudaDriverVersion 12000
nid001417:92701:92701 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001417:92701:92701 [0] NVSHMEM INFO [4] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001417:92701:92701 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x244c760
nid001417:92704:92704 [3] NVSHMEM INFO [7] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001417:92704:92704 [3] NVSHMEM INFO in get_cucontext, queried and saved context for device: 3 context: 0x2433c20
nid001417:92703:92703 [2] NVSHMEM INFO [6] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3010
nid001417:92704:92704 [3] NVSHMEM INFO [7] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3200
nid001417:92701:92701 [0] NVSHMEM INFO [4] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3850
nid001417:92702:92702 [1] NVSHMEM INFO [5] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3160
nid001233:91416:91416 [0] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001233:91418:91418 [2] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001233:91419:91419 [3] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001417:92703:92703 [2] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001233:91417:91417 [1] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001417:92701:92701 [0] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001417:92704:92704 [3] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001417:92702:92702 [1] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001233:91418:91418 [2] NVSHMEM INFO [2] heap base: 0x145b60000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001233:91418:91418 [2] NVSHMEM INFO [2] mspace ptr: 0x145c5e3c8340
nid001233:91416:91416 [0] NVSHMEM INFO [0] heap base: 0x1493c0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001233:91416:91416 [0] NVSHMEM INFO [0] mspace ptr: 0x1494b6f4a340
nid001233:91419:91419 [3] NVSHMEM INFO [3] heap base: 0x147540000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001233:91419:91419 [3] NVSHMEM INFO [3] mspace ptr: 0x147640955340
nid001233:91417:91417 [1] NVSHMEM INFO [1] heap base: 0x145740000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001233:91417:91417 [1] NVSHMEM INFO [1] mspace ptr: 0x14584210a340
nid001417:92703:92703 [2] NVSHMEM INFO [6] heap base: 0x14b6a0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001417:92703:92703 [2] NVSHMEM INFO [6] mspace ptr: 0x14b7a273a340
nid001417:92701:92701 [0] NVSHMEM INFO [4] heap base: 0x14fc60000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001417:92701:92701 [0] NVSHMEM INFO [4] mspace ptr: 0x14fd620fd340
nid001417:92704:92704 [3] NVSHMEM INFO [7] heap base: 0x145460000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001417:92704:92704 [3] NVSHMEM INFO [7] mspace ptr: 0x145554875340
nid001417:92702:92702 [1] NVSHMEM INFO [5] heap base: 0x14d120000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001417:92702:92702 [1] NVSHMEM INFO [5] mspace ptr: 0x14d212cc9340
nid001417:92703:92703 [2] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

nid001417:92702:92702 [1] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

nid001417:92701:92701 [0] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

nid001417:92704:92704 [3] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

nid001233:91417:91417 [1] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

nid001233:91418:91418 [2] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

nid001233:91419:91419 [3] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

nid001233:91416:91416 [0] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

[nid001417:92703:0:92703] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

[nid001417:92702:0:92702] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
[nid001417:92704:0:92704] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
[nid001417:92701:0:92701] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

[nid001233:91416:0:91416] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
[nid001233:91418:0:91418] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
[nid001233:91419:0:91419] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
[nid001233:91417:0:91417] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid:  91419) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x14763c255394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x14763c255564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14763c25582e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x14763e27c8c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x1476240ec9fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x1476240ee312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14763eb0fc89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14763eb1501c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14763eb16e99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14763eb1730e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x147640e86635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x147640a734c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x147640a9f450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x147640aa0105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x147640a9dd7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x147640a9e1bf]
==== backtrace (tid:  91416) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x1494b284a394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x1494b284a564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x1494b284a82e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x1494b48718c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x1494a01069fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x1494a0108312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x1494b5104c89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x1494b510a01c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x1494b510be99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x1494b510c30e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x1494b747b635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x1494b70684c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x1494b7094450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x1494b7095105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x1494b7092d7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x1494b70931bf]
==== backtrace (tid:  91418) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x145c59cc8394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x145c59cc8564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x145c59cc882e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x145c5bcef8c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x145c4c2989fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x145c4c29a312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x145c5c582c89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x145c5c58801c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x145c5c589e99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x145c5c58a30e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x145c5e8f9635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x145c5e4e64c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x145c5e512450]
==== backtrace (tid:  91417) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x14583da0a394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x14583da0a564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14583da0a82e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x14583fa318c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14581432d9fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x14581432f312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x1458402c4c89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x1458402ca01c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x1458402cbe99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x1458402cc30e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x14584263b635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x1458422284c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x145842254450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x145842255105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x145842252d7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x1458422531bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x147640a9e386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x14763d31529d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
=================================
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x1494b7093386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x1494b390a29d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
=================================
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x145c5e513105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x145c5e510d7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x145c5e5111bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x145c5e511386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x145c5ad8829d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
=================================
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x145842253386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x14583eaca29d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
=================================
==== backtrace (tid:  92702) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x14d20e5c9394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x14d20e5c9564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14d20e5c982e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x14d2105f08c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14d1e03d89fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x14d1e03da312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14d210e83c89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14d210e8901c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14d210e8ae99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14d210e8b30e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x14d2131fa635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x14d212de74c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14d212e13450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14d212e14105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14d212e11d7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14d212e121bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14d212e12386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x14d20f68929d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
=================================
==== backtrace (tid:  92703) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x14b79e03a394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x14b79e03a564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14b79e03a82e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x14b7a00618c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14b7723cd9fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x14b7723cf312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14b7a08f4c89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14b7a08fa01c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14b7a08fbe99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14b7a08fc30e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x14b7a2c6b635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x14b7a28584c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14b7a2884450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14b7a2885105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14b7a2882d7d]
==== backtrace (tid:  92701) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x14fd5d9fd394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x14fd5d9fd564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14fd5d9fd82e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x14fd5fa248c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14fd34b2d9fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x14fd34b2f312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14fd602b7c89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14fd602bd01c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14fd602bee99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14fd602bf30e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x14fd6262e635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x14fd6221b4c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14fd62247450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14fd62248105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14fd62245d7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14fd622461bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14fd62246386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x14fd5eabd29d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
=================================
==== backtrace (tid:  92704) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x145550175394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x145550175564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14555017582e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x14555219c8c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x1454eca479fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x1454eca49312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x145552a2fc89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x145552a3501c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x145552a36e99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x145552a3730e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x145554da6635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x1455549934c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x1455549bf450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x1455549c0105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x1455549bdd7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x1455549be1bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x1455549be386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x14555123529d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
=================================
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14b7a28831bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14b7a2883386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x14b79f0fa29d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
=================================
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
make: *** [Makefile:18: run] Error 139

@j-xiong
Copy link
Contributor

j-xiong commented Apr 23, 2024

@tylerjereddy We occasionally see similar segfault at finalization phase inside ucp_worker_destroy(), the exact reason has not been identified with the most like guess being some race-condition inside ucx.

I don't see any libfabric related symbols from your trace. I would suggest run with debug build of libfabric and ucx to help to locate where the segfault happens.

@tylerjereddy
Copy link
Contributor Author

tylerjereddy commented Apr 24, 2024

@j-xiong I swapped in debug version of ucx and libfabric and added the log below the fold. Also added was FI_LOG_LEVEL=debug. This assumes that LD_LIBRARY_PATH swapping is sufficient and that I don't need to rebuild/re-link things.

rm -rf cufftmp_r2c_c2r_slabs_GROMACS
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin/../bin/nvcc cufftmp_r2c_c2r_slabs_GROMACS.cu -o cufftmp_r2c_c2r_slabs_GROMACS -std=c++17 --generate-code arch=compute_70,code=sm_70 --generate-code arch=compute_80,code=sm_80 --generate-code arch=compute_90,code=sm_90 -I/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp -I/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include -I/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/include -lcuda -L/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib -L/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib  -lcufftMp -lnvshmem_device -lnvshmem_host -L/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib -lmpi
LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib:/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/ucx:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.18.1-val5ydwfbxqr7fvv5xpryk73qkxatlrg/lib:/opt/cray/pe/papi/7.0.0.2/lib64:/opt/cray/libfabric/1.15.2.0/lib64:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib" /lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin/mpirun -oversubscribe -n 8 -N 4 cufftmp_r2c_c2r_slabs_GROMACS
Hello from rank 3/8 using GPU 3
Hello from rank 2/8 using GPU 2
Hello from rank 1/8 using GPU 1
Hello from rank 0/8 using GPU 0
Hello from rank 4/8 using GPU 0
Hello from rank 6/8 using GPU 2
Hello from rank 7/8 using GPU 3
Hello from rank 5/8 using GPU 1
NVSHMEM configuration:
  CUDA API                     12030
  CUDA Runtime                 12030
  CUDA Driver                  12000
  Build Timestamp              Apr 22 2024 12:58:35
  Build Variables             
	NVSHMEM_DEBUG=OFF NVSHMEM_DEVEL=OFF NVSHMEM_DEFAULT_PMI2=OFF
	NVSHMEM_DEFAULT_PMIX=OFF NVSHMEM_DEFAULT_UCX=OFF NVSHMEM_DISABLE_COLL_POLL=ON
	NVSHMEM_ENABLE_ALL_DEVICE_INLINING=OFF NVSHMEM_ENV_ALL=OFF
	NVSHMEM_GPU_COLL_USE_LDST=OFF NVSHMEM_IBGDA_SUPPORT=OFF
	NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=OFF NVSHMEM_IBDEVX_SUPPORT=OFF
	NVSHMEM_IBRC_SUPPORT=ON NVSHMEM_LIBFABRIC_SUPPORT=ON NVSHMEM_MPI_SUPPORT=ON
	NVSHMEM_NVTX=ON NVSHMEM_PMIX_SUPPORT=OFF NVSHMEM_SHMEM_SUPPORT=OFF
	NVSHMEM_TEST_STATIC_LIB=OFF NVSHMEM_TIMEOUT_DEVICE_POLLING=OFF NVSHMEM_TRACE=OFF
	NVSHMEM_UCX_SUPPORT=OFF NVSHMEM_USE_DLMALLOC=OFF NVSHMEM_USE_NCCL=OFF
	NVSHMEM_USE_GDRCOPY=ON NVSHMEM_VERBOSE=OFF
	CUDA_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3
	GDRCOPY_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/gdrcopy-2.3-ftyzikjaithdoznahhzpuecguynyqqyv
	LIBFABRIC_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.18.1-opv2jutclmudyzxdeud4xjggqrubip3u
	MPI_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install
	NCCL_HOME=/usr/local/nccl
	NVSHMEM_PREFIX=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install
	PMIX_HOME=/usr
	SHMEM_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install
	UCX_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z

nid001500:60834:60834 [3] NVSHMEM INFO PE 3 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
nid001500:60833:60833 [2] NVSHMEM INFO PE 2 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
nid001500:60832:60832 [1] NVSHMEM INFO PE 1 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
nid001500:60831:60831 [0] NVSHMEM INFO PE 0 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
nid001504:71717:71717 [2] NVSHMEM INFO PE 6 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
nid001504:71715:71715 [0] NVSHMEM INFO PE 4 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
nid001504:71716:71716 [1] NVSHMEM INFO PE 5 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
nid001504:71718:71718 [3] NVSHMEM INFO PE 7 (process) affinity to 128 CPUs:
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
    27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
    54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
    81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
    106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
    126 127 
nid001500:60833:60833 [2] NVSHMEM INFO cudaDriverVersion 12000
nid001500:60833:60833 [2] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001500:60833:60833 [2] NVSHMEM INFO [2] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001500:60833:60833 [2] NVSHMEM INFO in get_cucontext, queried and saved context for device: 2 context: 0x243c110
nid001500:60834:60834 [3] NVSHMEM INFO cudaDriverVersion 12000
nid001500:60834:60834 [3] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001500:60834:60834 [3] NVSHMEM INFO [3] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001500:60834:60834 [3] NVSHMEM INFO in get_cucontext, queried and saved context for device: 3 context: 0x2433d50
nid001500:60832:60832 [1] NVSHMEM INFO cudaDriverVersion 12000
nid001500:60832:60832 [1] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001500:60832:60832 [1] NVSHMEM INFO [1] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001500:60832:60832 [1] NVSHMEM INFO in get_cucontext, queried and saved context for device: 1 context: 0x24444d0
nid001504:71715:71715 [0] NVSHMEM INFO cudaDriverVersion 12000
nid001504:71715:71715 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001504:71715:71715 [0] NVSHMEM INFO [4] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001500:60833:60833 [2] NVSHMEM INFO [2] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c32c0
nid001500:60834:60834 [3] NVSHMEM INFO [3] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3160
nid001500:60831:60831 [0] NVSHMEM INFO cudaDriverVersion 12000
nid001500:60832:60832 [1] NVSHMEM INFO [1] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3270
nid001500:60831:60831 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001500:60831:60831 [0] NVSHMEM INFO [0] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001500:60831:60831 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x244c890
nid001504:71715:71715 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x244c770
nid001504:71718:71718 [3] NVSHMEM INFO cudaDriverVersion 12000
nid001504:71718:71718 [3] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001504:71717:71717 [2] NVSHMEM INFO cudaDriverVersion 12000
nid001504:71717:71717 [2] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001504:71718:71718 [3] NVSHMEM INFO [7] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001504:71718:71718 [3] NVSHMEM INFO in get_cucontext, queried and saved context for device: 3 context: 0x2433c30
nid001500:60831:60831 [0] NVSHMEM INFO [0] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3f20
nid001504:71717:71717 [2] NVSHMEM INFO [6] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001504:71717:71717 [2] NVSHMEM INFO in get_cucontext, queried and saved context for device: 2 context: 0x243bff0
nid001504:71716:71716 [1] NVSHMEM INFO cudaDriverVersion 12000
nid001504:71716:71716 [1] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
nid001504:71716:71716 [1] NVSHMEM INFO [5] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
nid001504:71716:71716 [1] NVSHMEM INFO in get_cucontext, queried and saved context for device: 1 context: 0x24443b0
nid001504:71715:71715 [0] NVSHMEM INFO [4] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3990
nid001504:71718:71718 [3] NVSHMEM INFO [7] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c2cb0
nid001504:71717:71717 [2] NVSHMEM INFO [6] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c34c0
nid001504:71716:71716 [1] NVSHMEM INFO [5] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c2ea0
nid001500:60833:60833 [2] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001500:60831:60831 [0] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001500:60834:60834 [3] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001500:60832:60832 [1] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001504:71717:71717 [2] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001504:71715:71715 [0] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001504:71718:71718 [3] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001504:71716:71716 [1] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
nid001504:71717:71717 [2] NVSHMEM INFO [6] heap base: 0x147a00000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001504:71717:71717 [2] NVSHMEM INFO [6] mspace ptr: 0x147afba7c340
nid001504:71718:71718 [3] NVSHMEM INFO [7] heap base: 0x14c3c0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001504:71718:71718 [3] NVSHMEM INFO [7] mspace ptr: 0x14c4b2073340
nid001504:71716:71716 [1] NVSHMEM INFO [5] heap base: 0x14d700000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001504:71716:71716 [1] NVSHMEM INFO [5] mspace ptr: 0x14d7f1ab7340
nid001504:71715:71715 [0] NVSHMEM INFO [4] heap base: 0x1457c0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001504:71715:71715 [0] NVSHMEM INFO [4] mspace ptr: 0x1458c7180340
nid001500:60833:60833 [2] NVSHMEM INFO [2] heap base: 0x14e9e0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001500:60833:60833 [2] NVSHMEM INFO [2] mspace ptr: 0x14eaccb8e340
nid001500:60831:60831 [0] NVSHMEM INFO [0] heap base: 0x14e060000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001500:60831:60831 [0] NVSHMEM INFO [0] mspace ptr: 0x14e155715340
nid001500:60834:60834 [3] NVSHMEM INFO [3] heap base: 0x14e0e0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001500:60834:60834 [3] NVSHMEM INFO [3] mspace ptr: 0x14e1e421c340
nid001500:60832:60832 [1] NVSHMEM INFO [1] heap base: 0x1517e0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000
nid001500:60832:60832 [1] NVSHMEM INFO [1] mspace ptr: 0x1518d516b340
nid001504:71715:71715 [0] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

nid001504:71718:71718 [3] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var perf_cntr
libfabric:71715:1713994966::core:core:fi_param_get_():372<info> variable perf_cntr=<not set>
libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var hook
libfabric:71715:1713994966::core:core:fi_param_get_():372<info> variable hook=<not set>
libfabric:71715:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:71715:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:71715:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:71715:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:71715:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:71715:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_size
libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_count
libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_monitor
libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cuda_cache_monitor_enabled
libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var mr_rocr_cache_monitor_enabled
libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var mr_ze_cache_monitor_enabled
libfabric:71715:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_size=<not set>
libfabric:71715:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_count=<not set>
libfabric:71715:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_monitor=<not set>
libfabric:71715:1713994966::core:core:fi_param_get_():372<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:71715:1713994966::core:core:fi_param_get_():372<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:71715:1713994966::core:core:fi_param_get_():372<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:71715:1713994966::core:mr:ofi_default_cache_size():78<info> default cache size=1053667248
libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var provider
libfabric:71715:1713994966::core:core:fi_param_get_():372<info> variable provider=<not set>
libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var fork_unsafe
libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var universe_size
libfabric:71715:1713994966::core:core:fi_param_get_():372<info> variable universe_size=<not set>
libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var av_remove_cleanup
libfabric:71715:1713994966::core:core:fi_param_get_():372<info> variable av_remove_cleanup=<not set>
libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var offload_coll_provider
libfabric:71715:1713994966::core:core:fi_param_get_():372<info> variable offload_coll_provider=<not set>
libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var provider_path
libfabric:71715:1713994966::core:core:fi_param_get_():372<info> variable provider_path=<not set>
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libefa-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libefa-fi.so): libefa-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm2-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm2-fi.so): libpsm2-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libopx-fi.so
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var perf_cntr
libfabric:71718:1713994966::core:core:fi_param_get_():372<info> variable perf_cntr=<not set>
libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var hook
libfabric:71718:1713994966::core:core:fi_param_get_():372<info> variable hook=<not set>
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libopx-fi.so): libopx-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm-fi.so
libfabric:71718:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:71718:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:71718:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:71718:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:71718:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:71718:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_size
libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_count
libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_monitor
libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cuda_cache_monitor_enabled
libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var mr_rocr_cache_monitor_enabled
libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var mr_ze_cache_monitor_enabled
libfabric:71718:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_size=<not set>
libfabric:71718:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_count=<not set>
libfabric:71718:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_monitor=<not set>
libfabric:71718:1713994966::core:core:fi_param_get_():372<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:71718:1713994966::core:core:fi_param_get_():372<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:71718:1713994966::core:core:fi_param_get_():372<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:71718:1713994966::core:mr:ofi_default_cache_size():78<info> default cache size=1053667248
libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var provider
libfabric:71718:1713994966::core:core:fi_param_get_():372<info> variable provider=<not set>
libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var fork_unsafe
libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var universe_size
libfabric:71718:1713994966::core:core:fi_param_get_():372<info> variable universe_size=<not set>
libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var av_remove_cleanup
libfabric:71718:1713994966::core:core:fi_param_get_():372<info> variable av_remove_cleanup=<not set>
libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var offload_coll_provider
libfabric:71718:1713994966::core:core:fi_param_get_():372<info> variable offload_coll_provider=<not set>
libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var provider_path
libfabric:71718:1713994966::core:core:fi_param_get_():372<info> variable provider_path=<not set>
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libefa-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm-fi.so): libpsm-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libusnic-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libusnic-fi.so): libusnic-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libgni-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libefa-fi.so): libefa-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm2-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libgni-fi.so): libgni-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libbgq-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm2-fi.so): libpsm2-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libopx-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libbgq-fi.so): libbgq-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libverbs-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libopx-fi.so): libopx-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm-fi.so
nid001504:71717:71717 [2] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libverbs-fi.so): libverbs-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnetdir-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm-fi.so): libpsm-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libusnic-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnetdir-fi.so): libnetdir-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm3-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libusnic-fi.so): libusnic-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libgni-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm3-fi.so): libpsm3-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libucx-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libgni-fi.so): libgni-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libbgq-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libucx-fi.so): libucx-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxm-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libbgq-fi.so): libbgq-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libverbs-fi.so
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxm-fi.so): librxm-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxd-fi.so
libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var perf_cntr
libfabric:71717:1713994966::core:core:fi_param_get_():372<info> variable perf_cntr=<not set>
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libverbs-fi.so): libverbs-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnetdir-fi.so
libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var hook
libfabric:71717:1713994966::core:core:fi_param_get_():372<info> variable hook=<not set>
libfabric:71717:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:71717:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:71717:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:71717:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:71717:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:71717:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_size
libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_count
libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_monitor
libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cuda_cache_monitor_enabled
libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var mr_rocr_cache_monitor_enabled
libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var mr_ze_cache_monitor_enabled
libfabric:71717:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_size=<not set>
libfabric:71717:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_count=<not set>
libfabric:71717:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_monitor=<not set>
libfabric:71717:1713994966::core:core:fi_param_get_():372<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:71717:1713994966::core:core:fi_param_get_():372<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:71717:1713994966::core:core:fi_param_get_():372<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:71717:1713994966::core:mr:ofi_default_cache_size():78<info> default cache size=1053667248
libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var provider
libfabric:71717:1713994966::core:core:fi_param_get_():372<info> variable provider=<not set>
libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var fork_unsafe
libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var universe_size
libfabric:71717:1713994966::core:core:fi_param_get_():372<info> variable universe_size=<not set>
libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var av_remove_cleanup
libfabric:71717:1713994966::core:core:fi_param_get_():372<info> variable av_remove_cleanup=<not set>
nid001504:71716:71716 [1] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var offload_coll_provider
libfabric:71717:1713994966::core:core:fi_param_get_():372<info> variable offload_coll_provider=<not set>
libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var provider_path
libfabric:71717:1713994966::core:core:fi_param_get_():372<info> variable provider_path=<not set>
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libefa-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxd-fi.so): librxd-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libshm-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnetdir-fi.so): libnetdir-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm3-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libshm-fi.so): libshm-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libudp-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm3-fi.so): libpsm3-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libucx-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libefa-fi.so): libefa-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm2-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libudp-fi.so): libudp-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libtcp-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm2-fi.so): libpsm2-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libopx-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libucx-fi.so): libucx-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxm-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libtcp-fi.so): libtcp-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libsockets-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxm-fi.so): librxm-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxd-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libopx-fi.so): libopx-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxd-fi.so): librxd-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libshm-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm-fi.so): libpsm-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libusnic-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libsockets-fi.so): libsockets-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnet-fi.so
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libshm-fi.so): libshm-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libudp-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libusnic-fi.so): libusnic-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libgni-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnet-fi.so): libnet-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_perf-fi.so
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var perf_cntr
libfabric:71716:1713994966::core:core:fi_param_get_():372<info> variable perf_cntr=<not set>
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var hook
libfabric:71716:1713994966::core:core:fi_param_get_():372<info> variable hook=<not set>
libfabric:71716:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:71716:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:71716:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:71716:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:71716:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:71716:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_size
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_count
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_monitor
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cuda_cache_monitor_enabled
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var mr_rocr_cache_monitor_enabled
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var mr_ze_cache_monitor_enabled
libfabric:71716:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_size=<not set>
libfabric:71716:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_count=<not set>
libfabric:71716:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_monitor=<not set>
libfabric:71716:1713994966::core:core:fi_param_get_():372<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:71716:1713994966::core:core:fi_param_get_():372<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:71716:1713994966::core:core:fi_param_get_():372<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:71716:1713994966::core:mr:ofi_default_cache_size():78<info> default cache size=1053667248
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var provider
libfabric:71716:1713994966::core:core:fi_param_get_():372<info> variable provider=<not set>
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var fork_unsafe
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var universe_size
libfabric:71716:1713994966::core:core:fi_param_get_():372<info> variable universe_size=<not set>
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var av_remove_cleanup
libfabric:71716:1713994966::core:core:fi_param_get_():372<info> variable av_remove_cleanup=<not set>
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var offload_coll_provider
libfabric:71716:1713994966::core:core:fi_param_get_():372<info> variable offload_coll_provider=<not set>
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var provider_path
libfabric:71716:1713994966::core:core:fi_param_get_():372<info> variable provider_path=<not set>
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libefa-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libgni-fi.so): libgni-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libbgq-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libudp-fi.so): libudp-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libtcp-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_perf-fi.so): libhook_perf-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_trace-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libtcp-fi.so): libtcp-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libsockets-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libbgq-fi.so): libbgq-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libverbs-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_trace-fi.so): libhook_trace-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_debug-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libefa-fi.so): libefa-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm2-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libsockets-fi.so): libsockets-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnet-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libverbs-fi.so): libverbs-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnetdir-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_debug-fi.so): libhook_debug-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_noop-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm2-fi.so): libpsm2-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libopx-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnetdir-fi.so): libnetdir-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm3-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnet-fi.so): libnet-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_perf-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_noop-fi.so): libhook_noop-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_hmem-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libopx-fi.so): libopx-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm3-fi.so): libpsm3-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libucx-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_perf-fi.so): libhook_perf-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_trace-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_hmem-fi.so): libhook_hmem-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_dmabuf_peer_mem-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm-fi.so): libpsm-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libusnic-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_trace-fi.so): libhook_trace-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_debug-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libucx-fi.so): libucx-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxm-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_dmabuf_peer_mem-fi.so): libhook_dmabuf_peer_mem-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libcoll-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libusnic-fi.so): libusnic-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libgni-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_debug-fi.so): libhook_debug-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_noop-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxm-fi.so): librxm-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxd-fi.so
libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libcoll-fi.so): libcoll-fi.so: cannot open shared object file: No such file or directory
libfabric:71715:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71715:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71715:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71715:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71715:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71715:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71715:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71715:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71715:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71715:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71715:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71715:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71715:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71715:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71715:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71715:1713994966::udp:core:fi_param_define_():251<debug> registered var iface
libfabric:71715:1713994966::core:core:ofi_register_provider():466<info> registering provider: udp (118.10)
libfabric:71715:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_waittime
libfabric:71715:1713994966::sockets:core:fi_param_define_():251<debug> registered var conn_timeout
libfabric:71715:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_conn_retry
libfabric:71715:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_conn_map_sz
libfabric:71715:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_av_sz
libfabric:71715:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_cq_sz
libfabric:71715:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_eq_sz
libfabric:71715:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_affinity
libfabric:71715:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_enable
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libgni-fi.so): libgni-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libbgq-fi.so
libfabric:71715:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_time
libfabric:71715:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_intvl
libfabric:71715:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_probes
libfabric:71715:1713994966::sockets:core:fi_param_define_():251<debug> registered var iface
libfabric:71715:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_buf_sz
libfabric:71715:1713994966::sockets:core:fi_param_define_():251<debug> registered var dgram_drop_rate
libfabric:71715:1713994966::core:core:ofi_register_provider():466<info> registering provider: sockets (118.10)
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var prov_name
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable prov_name=<not set>
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var iface
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_low_range
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_high_range
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable port_high_range=<not set>
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable port_low_range=<not set>
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var tx_size
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var rx_size
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_inject
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable max_inject=<not set>
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved=<not set>
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved_size
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved_size=<not set>
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_rx_size
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable max_rx_size=<not set>
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var nodelay
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable nodelay=<not set>
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var staging_sbuf_size
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var prefetch_rbuf_size
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var zerocopy_size
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable staging_sbuf_size=<not set>
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable prefetch_rbuf_size=<not set>
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable zerocopy_size=<not set>
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var trace_msg
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable trace_msg=<not set>
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var disable_auto_progress
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable disable_auto_progress=<not set>
libfabric:71715:1713994966::tcp:core:fi_param_define_():251<debug> registered var io_uring
libfabric:71715:1713994966::tcp:core:fi_param_get_():372<info> variable io_uring=<not set>
libfabric:71715:1713994966::core:core:ofi_register_provider():466<info> registering provider: tcp (118.10)
libfabric:71715:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_perf (118.10)
libfabric:71715:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_trace (118.10)
libfabric:71715:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_debug (118.10)
libfabric:71715:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:71715:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:71715:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:71715:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:71715:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:71715:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:71715:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:71715:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_hmem (118.10)
libfabric:71715:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_dmabuf_peer_mem (118.10)
libfabric:71715:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_noop (118.10)
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_noop-fi.so): libhook_noop-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_hmem-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxd-fi.so): librxd-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libshm-fi.so
libfabric:71715:1713994966::core:core:ofi_register_provider():466<info> registering provider: off_coll (118.10)
libfabric:71715:1713994966::udp:core:util_getinfo():157<debug> checking info
libfabric:71715:1713994966::udp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71715:1713994966::udp:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:71715:1713994966::udp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71715:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider udp returned -61 (No data available)
libfabric:71715:1713994966::tcp:core:util_getinfo():157<debug> checking info
libfabric:71715:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71715:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:71715:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71715:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71715:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:71715:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71715:1713994966::tcp:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:71715:1713994966::tcp:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_RMA_EVENT, FI_SOURCE, FI_DIRECTED_RECV
libfabric:71715:1713994966::tcp:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:71715:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider tcp returned -61 (No data available)
libfabric:71715:1713994966::sockets:core:util_getinfo():157<debug> checking info
libfabric:71715:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71715:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:71715:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71715:1713994966::sockets:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:71715:1713994966::sockets:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE, FI_NAMED_RX_CTX, FI_DIRECTED_RECV
libfabric:71715:1713994966::sockets:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:71715:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71715:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:71715:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71715:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider sockets returned -61 (No data available)
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

[nid001504:71715:0:71715] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libbgq-fi.so): libbgq-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libverbs-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_hmem-fi.so): libhook_hmem-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_dmabuf_peer_mem-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libshm-fi.so): libshm-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libudp-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libverbs-fi.so): libverbs-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnetdir-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_dmabuf_peer_mem-fi.so): libhook_dmabuf_peer_mem-fi.so: cannot open shared object file: No such file or directory
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libcoll-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libudp-fi.so): libudp-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libtcp-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnetdir-fi.so): libnetdir-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm3-fi.so
libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libcoll-fi.so): libcoll-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libtcp-fi.so): libtcp-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libsockets-fi.so
libfabric:71718:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71718:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71718:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71718:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71718:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71718:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71718:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71718:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71718:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71718:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71718:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71718:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71718:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71718:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71718:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71718:1713994966::udp:core:fi_param_define_():251<debug> registered var iface
libfabric:71718:1713994966::core:core:ofi_register_provider():466<info> registering provider: udp (118.10)
libfabric:71718:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_waittime
libfabric:71718:1713994966::sockets:core:fi_param_define_():251<debug> registered var conn_timeout
libfabric:71718:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_conn_retry
libfabric:71718:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_conn_map_sz
libfabric:71718:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_av_sz
libfabric:71718:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_cq_sz
libfabric:71718:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_eq_sz
libfabric:71718:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_affinity
libfabric:71718:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_enable
libfabric:71718:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_time
libfabric:71718:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_intvl
libfabric:71718:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_probes
libfabric:71718:1713994966::sockets:core:fi_param_define_():251<debug> registered var iface
libfabric:71718:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_buf_sz
libfabric:71718:1713994966::sockets:core:fi_param_define_():251<debug> registered var dgram_drop_rate
libfabric:71718:1713994966::core:core:ofi_register_provider():466<info> registering provider: sockets (118.10)
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var prov_name
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable prov_name=<not set>
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var iface
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_low_range
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_high_range
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable port_high_range=<not set>
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable port_low_range=<not set>
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var tx_size
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var rx_size
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_inject
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable max_inject=<not set>
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved=<not set>
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved_size
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved_size=<not set>
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_rx_size
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable max_rx_size=<not set>
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var nodelay
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable nodelay=<not set>
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var staging_sbuf_size
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var prefetch_rbuf_size
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var zerocopy_size
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable staging_sbuf_size=<not set>
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable prefetch_rbuf_size=<not set>
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable zerocopy_size=<not set>
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var trace_msg
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable trace_msg=<not set>
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var disable_auto_progress
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable disable_auto_progress=<not set>
libfabric:71718:1713994966::tcp:core:fi_param_define_():251<debug> registered var io_uring
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm3-fi.so): libpsm3-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libucx-fi.so
libfabric:71718:1713994966::tcp:core:fi_param_get_():372<info> variable io_uring=<not set>
libfabric:71718:1713994966::core:core:ofi_register_provider():466<info> registering provider: tcp (118.10)
libfabric:71718:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_perf (118.10)
libfabric:71718:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_trace (118.10)
libfabric:71718:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_debug (118.10)
libfabric:71718:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:71718:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:71718:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:71718:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:71718:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:71718:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:71718:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:71718:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_hmem (118.10)
libfabric:71718:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_dmabuf_peer_mem (118.10)
libfabric:71718:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_noop (118.10)
libfabric:71718:1713994966::core:core:ofi_register_provider():466<info> registering provider: off_coll (118.10)
libfabric:71718:1713994966::udp:core:util_getinfo():157<debug> checking info
libfabric:71718:1713994966::udp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71718:1713994966::udp:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:71718:1713994966::udp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71718:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider udp returned -61 (No data available)
libfabric:71718:1713994966::tcp:core:util_getinfo():157<debug> checking info
libfabric:71718:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71718:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:71718:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71718:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71718:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:71718:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71718:1713994966::tcp:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:71718:1713994966::tcp:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_RMA_EVENT, FI_SOURCE, FI_DIRECTED_RECV
libfabric:71718:1713994966::tcp:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:71718:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider tcp returned -61 (No data available)
libfabric:71718:1713994966::sockets:core:util_getinfo():157<debug> checking info
libfabric:71718:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71718:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:71718:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71718:1713994966::sockets:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:71718:1713994966::sockets:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE, FI_NAMED_RX_CTX, FI_DIRECTED_RECV
libfabric:71718:1713994966::sockets:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:71718:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71718:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:71718:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71718:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider sockets returned -61 (No data available)
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libsockets-fi.so): libsockets-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnet-fi.so
[nid001504:71718:0:71718] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libucx-fi.so): libucx-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxm-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnet-fi.so): libnet-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_perf-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxm-fi.so): librxm-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxd-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_perf-fi.so): libhook_perf-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_trace-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxd-fi.so): librxd-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libshm-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_trace-fi.so): libhook_trace-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_debug-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libshm-fi.so): libshm-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libudp-fi.so
==== backtrace (tid:  71715) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(ucs_handle_error+0x294) [0x1458c2a80394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x30564) [0x1458c2a80564]
==== backtrace (tid:  71718) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(ucs_handle_error+0x294) [0x14c4ad973394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x30564) [0x14c4ad973564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x3082e) [0x14c4ad97382e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x14c4af99a8c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14c4a008e9fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x14c4a0090312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14c4b022dc89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14c4b023301c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14c4b0234e99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14c4b023530e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x14c4b25a4635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x14c4b21914c3]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x3082e) [0x1458c2a8082e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x1458c4aa78c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x1458b03269fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x1458b0328312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x1458c533ac89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x1458c534001c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x1458c5341e99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x1458c534230e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x1458c76b1635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x1458c729e4c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x1458c72ca450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x1458c72cb105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x1458c72c8d7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x1458c72c91bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x1458c72c9386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x1458c3b4029d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
=================================
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14c4b21bd450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14c4b21be105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14c4b21bbd7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14c4b21bc1bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14c4b21bc386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x14c4aea3329d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
=================================
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_debug-fi.so): libhook_debug-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_noop-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libudp-fi.so): libudp-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libtcp-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_noop-fi.so): libhook_noop-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_hmem-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libtcp-fi.so): libtcp-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libsockets-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_hmem-fi.so): libhook_hmem-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_dmabuf_peer_mem-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libsockets-fi.so): libsockets-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnet-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_dmabuf_peer_mem-fi.so): libhook_dmabuf_peer_mem-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libcoll-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnet-fi.so): libnet-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_perf-fi.so
libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libcoll-fi.so): libcoll-fi.so: cannot open shared object file: No such file or directory
libfabric:71717:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71717:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71717:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71717:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71717:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71717:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71717:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71717:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71717:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71717:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71717:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71717:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71717:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71717:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71717:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71717:1713994966::udp:core:fi_param_define_():251<debug> registered var iface
libfabric:71717:1713994966::core:core:ofi_register_provider():466<info> registering provider: udp (118.10)
libfabric:71717:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_waittime
libfabric:71717:1713994966::sockets:core:fi_param_define_():251<debug> registered var conn_timeout
libfabric:71717:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_conn_retry
libfabric:71717:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_conn_map_sz
libfabric:71717:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_av_sz
libfabric:71717:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_cq_sz
libfabric:71717:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_eq_sz
libfabric:71717:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_affinity
libfabric:71717:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_enable
libfabric:71717:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_time
libfabric:71717:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_intvl
libfabric:71717:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_probes
libfabric:71717:1713994966::sockets:core:fi_param_define_():251<debug> registered var iface
libfabric:71717:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_buf_sz
libfabric:71717:1713994966::sockets:core:fi_param_define_():251<debug> registered var dgram_drop_rate
libfabric:71717:1713994966::core:core:ofi_register_provider():466<info> registering provider: sockets (118.10)
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var prov_name
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable prov_name=<not set>
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var iface
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_low_range
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_high_range
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable port_high_range=<not set>
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable port_low_range=<not set>
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var tx_size
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var rx_size
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_inject
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable max_inject=<not set>
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved=<not set>
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved_size
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved_size=<not set>
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_rx_size
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable max_rx_size=<not set>
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var nodelay
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable nodelay=<not set>
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var staging_sbuf_size
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var prefetch_rbuf_size
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var zerocopy_size
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable staging_sbuf_size=<not set>
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable prefetch_rbuf_size=<not set>
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable zerocopy_size=<not set>
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var trace_msg
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable trace_msg=<not set>
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var disable_auto_progress
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable disable_auto_progress=<not set>
libfabric:71717:1713994966::tcp:core:fi_param_define_():251<debug> registered var io_uring
libfabric:71717:1713994966::tcp:core:fi_param_get_():372<info> variable io_uring=<not set>
libfabric:71717:1713994966::core:core:ofi_register_provider():466<info> registering provider: tcp (118.10)
libfabric:71717:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_perf (118.10)
libfabric:71717:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_trace (118.10)
libfabric:71717:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_debug (118.10)
libfabric:71717:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:71717:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:71717:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:71717:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:71717:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:71717:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:71717:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:71717:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_hmem (118.10)
libfabric:71717:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_dmabuf_peer_mem (118.10)
libfabric:71717:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_noop (118.10)
libfabric:71717:1713994966::core:core:ofi_register_provider():466<info> registering provider: off_coll (118.10)
libfabric:71717:1713994966::udp:core:util_getinfo():157<debug> checking info
libfabric:71717:1713994966::udp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71717:1713994966::udp:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:71717:1713994966::udp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71717:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider udp returned -61 (No data available)
libfabric:71717:1713994966::tcp:core:util_getinfo():157<debug> checking info
libfabric:71717:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71717:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:71717:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71717:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71717:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:71717:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71717:1713994966::tcp:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:71717:1713994966::tcp:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_RMA_EVENT, FI_SOURCE, FI_DIRECTED_RECV
libfabric:71717:1713994966::tcp:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:71717:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider tcp returned -61 (No data available)
libfabric:71717:1713994966::sockets:core:util_getinfo():157<debug> checking info
libfabric:71717:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71717:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:71717:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71717:1713994966::sockets:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:71717:1713994966::sockets:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE, FI_NAMED_RX_CTX, FI_DIRECTED_RECV
libfabric:71717:1713994966::sockets:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:71717:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71717:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:71717:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71717:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider sockets returned -61 (No data available)
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_perf-fi.so): libhook_perf-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_trace-fi.so
[nid001504:71717:0:71717] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid:  71717) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(ucs_handle_error+0x294) [0x147af737c394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x30564) [0x147af737c564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x3082e) [0x147af737c82e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x147af93a38c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x147ae04009fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x147ae0402312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x147af9c36c89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x147af9c3c01c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x147af9c3de99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x147af9c3e30e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x147afbfad635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x147afbb9a4c3]
nid001500:60834:60834 [3] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x147afbbc6450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x147afbbc7105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x147afbbc4d7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x147afbbc51bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x147afbbc5386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x147af843c29d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
=================================
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_trace-fi.so): libhook_trace-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_debug-fi.so
nid001500:60832:60832 [1] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_debug-fi.so): libhook_debug-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_noop-fi.so
nid001500:60833:60833 [2] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

nid001500:60831:60831 [0] NVSHMEM INFO IBRC transport skipped in favor of: libfabric

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var perf_cntr
libfabric:60834:1713994966::core:core:fi_param_get_():372<info> variable perf_cntr=<not set>
libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var hook
libfabric:60834:1713994966::core:core:fi_param_get_():372<info> variable hook=<not set>
libfabric:60834:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:60834:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:60834:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:60834:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:60834:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:60834:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_size
libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_count
libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_monitor
libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cuda_cache_monitor_enabled
libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var mr_rocr_cache_monitor_enabled
libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var mr_ze_cache_monitor_enabled
libfabric:60834:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_size=<not set>
libfabric:60834:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_count=<not set>
libfabric:60834:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_monitor=<not set>
libfabric:60834:1713994966::core:core:fi_param_get_():372<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:60834:1713994966::core:core:fi_param_get_():372<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:60834:1713994966::core:core:fi_param_get_():372<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:60834:1713994966::core:mr:ofi_default_cache_size():78<info> default cache size=1053684672
libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var provider
libfabric:60834:1713994966::core:core:fi_param_get_():372<info> variable provider=<not set>
libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var fork_unsafe
libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var universe_size
libfabric:60834:1713994966::core:core:fi_param_get_():372<info> variable universe_size=<not set>
libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var av_remove_cleanup
libfabric:60834:1713994966::core:core:fi_param_get_():372<info> variable av_remove_cleanup=<not set>
libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var offload_coll_provider
libfabric:60834:1713994966::core:core:fi_param_get_():372<info> variable offload_coll_provider=<not set>
libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var provider_path
libfabric:60834:1713994966::core:core:fi_param_get_():372<info> variable provider_path=<not set>
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libefa-fi.so
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var perf_cntr
libfabric:60832:1713994966::core:core:fi_param_get_():372<info> variable perf_cntr=<not set>
libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var hook
libfabric:60832:1713994966::core:core:fi_param_get_():372<info> variable hook=<not set>
libfabric:60832:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:60832:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:60832:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:60832:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:60832:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:60832:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_size
libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_count
libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_monitor
libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cuda_cache_monitor_enabled
libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var mr_rocr_cache_monitor_enabled
libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var mr_ze_cache_monitor_enabled
libfabric:60832:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_size=<not set>
libfabric:60832:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_count=<not set>
libfabric:60832:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_monitor=<not set>
libfabric:60832:1713994966::core:core:fi_param_get_():372<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:60832:1713994966::core:core:fi_param_get_():372<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:60832:1713994966::core:core:fi_param_get_():372<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:60832:1713994966::core:mr:ofi_default_cache_size():78<info> default cache size=1053684672
libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var provider
libfabric:60832:1713994966::core:core:fi_param_get_():372<info> variable provider=<not set>
libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var fork_unsafe
libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var universe_size
libfabric:60832:1713994966::core:core:fi_param_get_():372<info> variable universe_size=<not set>
libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var av_remove_cleanup
libfabric:60832:1713994966::core:core:fi_param_get_():372<info> variable av_remove_cleanup=<not set>
libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var offload_coll_provider
libfabric:60832:1713994966::core:core:fi_param_get_():372<info> variable offload_coll_provider=<not set>
libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var provider_path
libfabric:60832:1713994966::core:core:fi_param_get_():372<info> variable provider_path=<not set>
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libefa-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libefa-fi.so): libefa-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm2-fi.so
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_noop-fi.so): libhook_noop-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_hmem-fi.so
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var perf_cntr
libfabric:60833:1713994966::core:core:fi_param_get_():372<info> variable perf_cntr=<not set>
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var hook
libfabric:60833:1713994966::core:core:fi_param_get_():372<info> variable hook=<not set>
libfabric:60833:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:60833:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:60833:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:60833:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:60833:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:60833:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_size
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_count
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_monitor
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cuda_cache_monitor_enabled
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var mr_rocr_cache_monitor_enabled
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var mr_ze_cache_monitor_enabled
libfabric:60833:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_size=<not set>
libfabric:60833:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_count=<not set>
libfabric:60833:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_monitor=<not set>
libfabric:60833:1713994966::core:core:fi_param_get_():372<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:60833:1713994966::core:core:fi_param_get_():372<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:60833:1713994966::core:core:fi_param_get_():372<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:60833:1713994966::core:mr:ofi_default_cache_size():78<info> default cache size=1053684672
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var provider
libfabric:60833:1713994966::core:core:fi_param_get_():372<info> variable provider=<not set>
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var fork_unsafe
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var universe_size
libfabric:60833:1713994966::core:core:fi_param_get_():372<info> variable universe_size=<not set>
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var av_remove_cleanup
libfabric:60833:1713994966::core:core:fi_param_get_():372<info> variable av_remove_cleanup=<not set>
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var offload_coll_provider
libfabric:60833:1713994966::core:core:fi_param_get_():372<info> variable offload_coll_provider=<not set>
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var provider_path
libfabric:60833:1713994966::core:core:fi_param_get_():372<info> variable provider_path=<not set>
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libefa-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm2-fi.so): libpsm2-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libopx-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libefa-fi.so): libefa-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm2-fi.so
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling.

libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libopx-fi.so): libopx-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm-fi.so
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var perf_cntr
libfabric:60831:1713994966::core:core:fi_param_get_():372<info> variable perf_cntr=<not set>
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var hook
libfabric:60831:1713994966::core:core:fi_param_get_():372<info> variable hook=<not set>
libfabric:60831:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:60831:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:60831:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:60831:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:60831:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:60831:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_size
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_max_count
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cache_monitor
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var mr_cuda_cache_monitor_enabled
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var mr_rocr_cache_monitor_enabled
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var mr_ze_cache_monitor_enabled
libfabric:60831:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_size=<not set>
libfabric:60831:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_max_count=<not set>
libfabric:60831:1713994966::core:core:fi_param_get_():372<info> variable mr_cache_monitor=<not set>
libfabric:60831:1713994966::core:core:fi_param_get_():372<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:60831:1713994966::core:core:fi_param_get_():372<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:60831:1713994966::core:core:fi_param_get_():372<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:60831:1713994966::core:mr:ofi_default_cache_size():78<info> default cache size=1053684672
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var provider
libfabric:60831:1713994966::core:core:fi_param_get_():372<info> variable provider=<not set>
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var fork_unsafe
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var universe_size
libfabric:60831:1713994966::core:core:fi_param_get_():372<info> variable universe_size=<not set>
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var av_remove_cleanup
libfabric:60831:1713994966::core:core:fi_param_get_():372<info> variable av_remove_cleanup=<not set>
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var offload_coll_provider
libfabric:60831:1713994966::core:core:fi_param_get_():372<info> variable offload_coll_provider=<not set>
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var provider_path
libfabric:60831:1713994966::core:core:fi_param_get_():372<info> variable provider_path=<not set>
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libefa-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm2-fi.so): libpsm2-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libopx-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libefa-fi.so): libefa-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm2-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm-fi.so): libpsm-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libusnic-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libopx-fi.so): libopx-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm2-fi.so): libpsm2-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libopx-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libusnic-fi.so): libusnic-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libgni-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libefa-fi.so): libefa-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm2-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm-fi.so): libpsm-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libusnic-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libopx-fi.so): libopx-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libgni-fi.so): libgni-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libbgq-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm2-fi.so): libpsm2-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libopx-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_hmem-fi.so): libhook_hmem-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_dmabuf_peer_mem-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm-fi.so): libpsm-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libusnic-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libusnic-fi.so): libusnic-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libgni-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libbgq-fi.so): libbgq-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libverbs-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libopx-fi.so): libopx-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libusnic-fi.so): libusnic-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libgni-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libgni-fi.so): libgni-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libbgq-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libverbs-fi.so): libverbs-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnetdir-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm-fi.so): libpsm-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libusnic-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libgni-fi.so): libgni-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libbgq-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libbgq-fi.so): libbgq-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libverbs-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnetdir-fi.so): libnetdir-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm3-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libusnic-fi.so): libusnic-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libgni-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libbgq-fi.so): libbgq-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libverbs-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libverbs-fi.so): libverbs-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnetdir-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm3-fi.so): libpsm3-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libucx-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libgni-fi.so): libgni-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libbgq-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libverbs-fi.so): libverbs-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnetdir-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnetdir-fi.so): libnetdir-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm3-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libucx-fi.so): libucx-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxm-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libbgq-fi.so): libbgq-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libverbs-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_dmabuf_peer_mem-fi.so): libhook_dmabuf_peer_mem-fi.so: cannot open shared object file: No such file or directory
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libcoll-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnetdir-fi.so): libnetdir-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm3-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm3-fi.so): libpsm3-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libucx-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxm-fi.so): librxm-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxd-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libverbs-fi.so): libverbs-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnetdir-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm3-fi.so): libpsm3-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libucx-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libucx-fi.so): libucx-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxm-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxd-fi.so): librxd-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libshm-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnetdir-fi.so): libnetdir-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libpsm3-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libucx-fi.so): libucx-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxm-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxm-fi.so): librxm-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxd-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libshm-fi.so): libshm-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libudp-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libpsm3-fi.so): libpsm3-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libucx-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxm-fi.so): librxm-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxd-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxd-fi.so): librxd-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libshm-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libucx-fi.so): libucx-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxm-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libudp-fi.so): libudp-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libtcp-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxd-fi.so): librxd-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libshm-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libshm-fi.so): libshm-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libudp-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxm-fi.so): librxm-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib librxd-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libtcp-fi.so): libtcp-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libsockets-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libshm-fi.so): libshm-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libudp-fi.so
libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libcoll-fi.so): libcoll-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libudp-fi.so): libudp-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libtcp-fi.so
libfabric:71716:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71716:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71716:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71716:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71716:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71716:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71716:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(librxd-fi.so): librxd-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libshm-fi.so
libfabric:71716:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71716:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71716:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71716:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71716:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71716:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71716:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71716:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:71716:1713994966::udp:core:fi_param_define_():251<debug> registered var iface
libfabric:71716:1713994966::core:core:ofi_register_provider():466<info> registering provider: udp (118.10)
libfabric:71716:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_waittime
libfabric:71716:1713994966::sockets:core:fi_param_define_():251<debug> registered var conn_timeout
libfabric:71716:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_conn_retry
libfabric:71716:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_conn_map_sz
libfabric:71716:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_av_sz
libfabric:71716:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_cq_sz
libfabric:71716:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_eq_sz
libfabric:71716:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_affinity
libfabric:71716:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_enable
libfabric:71716:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_time
libfabric:71716:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_intvl
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libsockets-fi.so): libsockets-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnet-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libudp-fi.so): libudp-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libtcp-fi.so
libfabric:71716:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_probes
libfabric:71716:1713994966::sockets:core:fi_param_define_():251<debug> registered var iface
libfabric:71716:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_buf_sz
libfabric:71716:1713994966::sockets:core:fi_param_define_():251<debug> registered var dgram_drop_rate
libfabric:71716:1713994966::core:core:ofi_register_provider():466<info> registering provider: sockets (118.10)
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var prov_name
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable prov_name=<not set>
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var iface
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_low_range
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_high_range
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable port_high_range=<not set>
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable port_low_range=<not set>
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var tx_size
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var rx_size
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libtcp-fi.so): libtcp-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libsockets-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libshm-fi.so): libshm-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libudp-fi.so
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_inject
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable max_inject=<not set>
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved=<not set>
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved_size
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved_size=<not set>
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_rx_size
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable max_rx_size=<not set>
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var nodelay
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable nodelay=<not set>
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var staging_sbuf_size
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var prefetch_rbuf_size
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var zerocopy_size
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable staging_sbuf_size=<not set>
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable prefetch_rbuf_size=<not set>
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable zerocopy_size=<not set>
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var trace_msg
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable trace_msg=<not set>
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var disable_auto_progress
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnet-fi.so): libnet-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_perf-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libtcp-fi.so): libtcp-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libsockets-fi.so
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable disable_auto_progress=<not set>
libfabric:71716:1713994966::tcp:core:fi_param_define_():251<debug> registered var io_uring
libfabric:71716:1713994966::tcp:core:fi_param_get_():372<info> variable io_uring=<not set>
libfabric:71716:1713994966::core:core:ofi_register_provider():466<info> registering provider: tcp (118.10)
libfabric:71716:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_perf (118.10)
libfabric:71716:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_trace (118.10)
libfabric:71716:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_debug (118.10)
libfabric:71716:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:71716:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:71716:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:71716:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:71716:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:71716:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:71716:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:71716:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_hmem (118.10)
libfabric:71716:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_dmabuf_peer_mem (118.10)
libfabric:71716:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_noop (118.10)
libfabric:71716:1713994966::core:core:ofi_register_provider():466<info> registering provider: off_coll (118.10)
libfabric:71716:1713994966::udp:core:util_getinfo():157<debug> checking info
libfabric:71716:1713994966::udp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71716:1713994966::udp:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:71716:1713994966::udp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71716:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider udp returned -61 (No data available)
libfabric:71716:1713994966::tcp:core:util_getinfo():157<debug> checking info
libfabric:71716:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71716:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:71716:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71716:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71716:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:71716:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71716:1713994966::tcp:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:71716:1713994966::tcp:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_RMA_EVENT, FI_SOURCE, FI_DIRECTED_RECV
libfabric:71716:1713994966::tcp:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:71716:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider tcp returned -61 (No data available)
libfabric:71716:1713994966::sockets:core:util_getinfo():157<debug> checking info
libfabric:71716:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71716:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:71716:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71716:1713994966::sockets:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:71716:1713994966::sockets:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE, FI_NAMED_RX_CTX, FI_DIRECTED_RECV
libfabric:71716:1713994966::sockets:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:71716:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:71716:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:71716:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:71716:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider sockets returned -61 (No data available)
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libsockets-fi.so): libsockets-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnet-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libudp-fi.so): libudp-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libtcp-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libsockets-fi.so): libsockets-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnet-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_perf-fi.so): libhook_perf-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_trace-fi.so
[nid001504:71716:0:71716] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid:  71716) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(ucs_handle_error+0x294) [0x14d7ed3b7394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x30564) [0x14d7ed3b7564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x3082e) [0x14d7ed3b782e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x14d7ef3de8c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14d7d1d3d9fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x14d7d1d3f312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14d7efc71c89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14d7efc7701c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14d7efc78e99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14d7efc7930e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x14d7f1fe8635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x14d7f1bd54c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14d7f1c01450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14d7f1c02105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14d7f1bffd7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14d7f1c001bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14d7f1c00386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x14d7ee47729d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
=================================
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnet-fi.so): libnet-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_perf-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libtcp-fi.so): libtcp-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libsockets-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnet-fi.so): libnet-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_perf-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_trace-fi.so): libhook_trace-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_debug-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_perf-fi.so): libhook_perf-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_trace-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libsockets-fi.so): libsockets-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libnet-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_perf-fi.so): libhook_perf-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_trace-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_debug-fi.so): libhook_debug-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_noop-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_trace-fi.so): libhook_trace-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_debug-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libnet-fi.so): libnet-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_perf-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_trace-fi.so): libhook_trace-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_debug-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_noop-fi.so): libhook_noop-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_hmem-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_debug-fi.so): libhook_debug-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_noop-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_perf-fi.so): libhook_perf-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_trace-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_debug-fi.so): libhook_debug-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_noop-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_hmem-fi.so): libhook_hmem-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_dmabuf_peer_mem-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_trace-fi.so): libhook_trace-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_debug-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_noop-fi.so): libhook_noop-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_hmem-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_noop-fi.so): libhook_noop-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_hmem-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_dmabuf_peer_mem-fi.so): libhook_dmabuf_peer_mem-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libcoll-fi.so
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_debug-fi.so): libhook_debug-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_noop-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_hmem-fi.so): libhook_hmem-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_dmabuf_peer_mem-fi.so
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_hmem-fi.so): libhook_hmem-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_dmabuf_peer_mem-fi.so
libfabric:60834:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libcoll-fi.so): libcoll-fi.so: cannot open shared object file: No such file or directory
libfabric:60834:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60834:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60834:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60834:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60834:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60834:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60834:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60834:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60834:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60834:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60834:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60834:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60834:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60834:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60834:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60834:1713994966::udp:core:fi_param_define_():251<debug> registered var iface
libfabric:60834:1713994966::core:core:ofi_register_provider():466<info> registering provider: udp (118.10)
libfabric:60834:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_waittime
libfabric:60834:1713994966::sockets:core:fi_param_define_():251<debug> registered var conn_timeout
libfabric:60834:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_conn_retry
libfabric:60834:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_conn_map_sz
libfabric:60834:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_av_sz
libfabric:60834:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_cq_sz
libfabric:60834:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_eq_sz
libfabric:60834:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_affinity
libfabric:60834:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_enable
libfabric:60834:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_time
libfabric:60834:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_intvl
libfabric:60834:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_probes
libfabric:60834:1713994966::sockets:core:fi_param_define_():251<debug> registered var iface
libfabric:60834:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_buf_sz
libfabric:60834:1713994966::sockets:core:fi_param_define_():251<debug> registered var dgram_drop_rate
libfabric:60834:1713994966::core:core:ofi_register_provider():466<info> registering provider: sockets (118.10)
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var prov_name
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable prov_name=<not set>
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var iface
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_low_range
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_high_range
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable port_high_range=<not set>
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable port_low_range=<not set>
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var tx_size
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var rx_size
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_inject
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable max_inject=<not set>
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved=<not set>
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved_size
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved_size=<not set>
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_rx_size
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable max_rx_size=<not set>
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var nodelay
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable nodelay=<not set>
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var staging_sbuf_size
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_noop-fi.so): libhook_noop-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_hmem-fi.so
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var prefetch_rbuf_size
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var zerocopy_size
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable staging_sbuf_size=<not set>
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable prefetch_rbuf_size=<not set>
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_dmabuf_peer_mem-fi.so): libhook_dmabuf_peer_mem-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libcoll-fi.so
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable zerocopy_size=<not set>
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var trace_msg
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable trace_msg=<not set>
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var disable_auto_progress
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable disable_auto_progress=<not set>
libfabric:60834:1713994966::tcp:core:fi_param_define_():251<debug> registered var io_uring
libfabric:60834:1713994966::tcp:core:fi_param_get_():372<info> variable io_uring=<not set>
libfabric:60834:1713994966::core:core:ofi_register_provider():466<info> registering provider: tcp (118.10)
libfabric:60834:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_perf (118.10)
libfabric:60834:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_trace (118.10)
libfabric:60834:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_debug (118.10)
libfabric:60834:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:60834:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:60834:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:60834:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:60834:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:60834:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:60834:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:60834:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_hmem (118.10)
libfabric:60834:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_dmabuf_peer_mem (118.10)
libfabric:60834:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_noop (118.10)
libfabric:60834:1713994966::core:core:ofi_register_provider():466<info> registering provider: off_coll (118.10)
libfabric:60834:1713994966::udp:core:util_getinfo():157<debug> checking info
libfabric:60834:1713994966::udp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60834:1713994966::udp:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:60834:1713994966::udp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_dmabuf_peer_mem-fi.so): libhook_dmabuf_peer_mem-fi.so: cannot open shared object file: No such file or directory
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libcoll-fi.so
libfabric:60834:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider udp returned -61 (No data available)
libfabric:60834:1713994966::tcp:core:util_getinfo():157<debug> checking info
libfabric:60834:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60834:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:60834:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60834:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60834:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:60834:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60834:1713994966::tcp:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:60834:1713994966::tcp:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_RMA_EVENT, FI_SOURCE, FI_DIRECTED_RECV
libfabric:60834:1713994966::tcp:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:60834:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider tcp returned -61 (No data available)
libfabric:60834:1713994966::sockets:core:util_getinfo():157<debug> checking info
libfabric:60834:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60834:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:60834:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60834:1713994966::sockets:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:60834:1713994966::sockets:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE, FI_NAMED_RX_CTX, FI_DIRECTED_RECV
libfabric:60834:1713994966::sockets:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:60834:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60834:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:60834:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60834:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider sockets returned -61 (No data available)
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

[nid001500:60834:0:60834] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_hmem-fi.so): libhook_hmem-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libhook_dmabuf_peer_mem-fi.so
libfabric:60832:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libcoll-fi.so): libcoll-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::udp:core:fi_param_define_():251<debug> registered var iface
libfabric:60832:1713994966::core:core:ofi_register_provider():466<info> registering provider: udp (118.10)
libfabric:60832:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_waittime
libfabric:60832:1713994966::sockets:core:fi_param_define_():251<debug> registered var conn_timeout
libfabric:60832:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_conn_retry
libfabric:60832:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_conn_map_sz
libfabric:60832:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_av_sz
libfabric:60832:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_cq_sz
libfabric:60832:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_eq_sz
libfabric:60832:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_affinity
libfabric:60832:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_enable
libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libcoll-fi.so): libcoll-fi.so: cannot open shared object file: No such file or directory
libfabric:60832:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_time
libfabric:60832:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_intvl
libfabric:60832:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_probes
libfabric:60832:1713994966::sockets:core:fi_param_define_():251<debug> registered var iface
libfabric:60832:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_buf_sz
libfabric:60832:1713994966::sockets:core:fi_param_define_():251<debug> registered var dgram_drop_rate
libfabric:60832:1713994966::core:core:ofi_register_provider():466<info> registering provider: sockets (118.10)
libfabric:60833:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60833:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60833:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60833:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60833:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60833:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var prov_name
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable prov_name=<not set>
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var iface
libfabric:60833:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60833:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60833:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60833:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60833:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60833:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60833:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60833:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_low_range
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_high_range
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable port_high_range=<not set>
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable port_low_range=<not set>
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var tx_size
libfabric:60833:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60833:1713994966::udp:core:fi_param_define_():251<debug> registered var iface
libfabric:60833:1713994966::core:core:ofi_register_provider():466<info> registering provider: udp (118.10)
==== backtrace (tid:  60834) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(ucs_handle_error+0x294) [0x14e1dfb1c394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x30564) [0x14e1dfb1c564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x3082e) [0x14e1dfb1c82e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x14e1e1b438c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14e1dc0ec9fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x14e1dc0ee312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14e1e23d6c89]
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var rx_size
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_inject
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable max_inject=<not set>
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved=<not set>
libfabric:60833:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_waittime
libfabric:60833:1713994966::sockets:core:fi_param_define_():251<debug> registered var conn_timeout
libfabric:60833:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_conn_retry
libfabric:60833:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_conn_map_sz
libfabric:60833:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_av_sz
libfabric:60833:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_cq_sz
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14e1e23dc01c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14e1e23dde99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14e1e23de30e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x14e1e474d635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x14e1e433a4c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14e1e4366450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14e1e4367105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14e1e4364d7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14e1e43651bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14e1e4365386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x14e1e0bdc29d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved_size
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved_size=<not set>
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_rx_size
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable max_rx_size=<not set>
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var nodelay
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable nodelay=<not set>
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var staging_sbuf_size
libfabric:60833:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_eq_sz
libfabric:60833:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_affinity
libfabric:60833:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_enable
libfabric:60833:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_time
libfabric:60833:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_intvl
libfabric:60833:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_probes
libfabric:60833:1713994966::sockets:core:fi_param_define_():251<debug> registered var iface
libfabric:60833:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_buf_sz
=================================
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var prefetch_rbuf_size
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var zerocopy_size
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable staging_sbuf_size=<not set>
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable prefetch_rbuf_size=<not set>
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable zerocopy_size=<not set>
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var trace_msg
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable trace_msg=<not set>
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var disable_auto_progress
libfabric:60833:1713994966::sockets:core:fi_param_define_():251<debug> registered var dgram_drop_rate
libfabric:60833:1713994966::core:core:ofi_register_provider():466<info> registering provider: sockets (118.10)
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var prov_name
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable prov_name=<not set>
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var iface
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable disable_auto_progress=<not set>
libfabric:60832:1713994966::tcp:core:fi_param_define_():251<debug> registered var io_uring
libfabric:60832:1713994966::tcp:core:fi_param_get_():372<info> variable io_uring=<not set>
libfabric:60832:1713994966::core:core:ofi_register_provider():466<info> registering provider: tcp (118.10)
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_low_range
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_high_range
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable port_high_range=<not set>
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable port_low_range=<not set>
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var tx_size
libfabric:60832:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_perf (118.10)
libfabric:60832:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_trace (118.10)
libfabric:60832:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_debug (118.10)
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var rx_size
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_inject
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable max_inject=<not set>
libfabric:60832:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:60832:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:60832:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:60832:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:60832:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:60832:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved=<not set>
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved_size
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved_size=<not set>
libfabric:60832:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:60832:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_hmem (118.10)
libfabric:60832:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_dmabuf_peer_mem (118.10)
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_rx_size
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable max_rx_size=<not set>
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var nodelay
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable nodelay=<not set>
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var staging_sbuf_size
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var prefetch_rbuf_size
libfabric:60832:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_noop (118.10)
libfabric:60832:1713994966::core:core:ofi_register_provider():466<info> registering provider: off_coll (118.10)
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var zerocopy_size
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable staging_sbuf_size=<not set>
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable prefetch_rbuf_size=<not set>
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable zerocopy_size=<not set>
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var trace_msg
libfabric:60832:1713994966::udp:core:util_getinfo():157<debug> checking info
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable trace_msg=<not set>
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var disable_auto_progress
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable disable_auto_progress=<not set>
libfabric:60833:1713994966::tcp:core:fi_param_define_():251<debug> registered var io_uring
libfabric:60832:1713994966::udp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60832:1713994966::udp:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:60832:1713994966::udp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libhook_dmabuf_peer_mem-fi.so): libhook_dmabuf_peer_mem-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():606<debug> opening provider lib libcoll-fi.so
libfabric:60833:1713994966::tcp:core:fi_param_get_():372<info> variable io_uring=<not set>
libfabric:60833:1713994966::core:core:ofi_register_provider():466<info> registering provider: tcp (118.10)
libfabric:60833:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_perf (118.10)
libfabric:60833:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_trace (118.10)
libfabric:60833:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_debug (118.10)
libfabric:60832:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider udp returned -61 (No data available)
libfabric:60832:1713994966::tcp:core:util_getinfo():157<debug> checking info
libfabric:60832:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60832:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:60832:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60832:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60832:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:60832:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60832:1713994966::tcp:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:60832:1713994966::tcp:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_RMA_EVENT, FI_SOURCE, FI_DIRECTED_RECV
libfabric:60833:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:60833:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:60833:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:60833:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:60833:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:60833:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:60833:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:60832:1713994966::tcp:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:60832:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider tcp returned -61 (No data available)
libfabric:60832:1713994966::sockets:core:util_getinfo():157<debug> checking info
libfabric:60832:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60832:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:60832:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60833:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_hmem (118.10)
libfabric:60833:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_dmabuf_peer_mem (118.10)
libfabric:60833:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_noop (118.10)
libfabric:60833:1713994966::core:core:ofi_register_provider():466<info> registering provider: off_coll (118.10)
libfabric:60832:1713994966::sockets:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:60832:1713994966::sockets:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE, FI_NAMED_RX_CTX, FI_DIRECTED_RECV
libfabric:60832:1713994966::sockets:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:60832:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60832:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:60832:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60833:1713994966::udp:core:util_getinfo():157<debug> checking info
libfabric:60832:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider sockets returned -61 (No data available)
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

libfabric:60833:1713994966::udp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60833:1713994966::udp:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:60833:1713994966::udp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60833:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider udp returned -61 (No data available)
libfabric:60833:1713994966::tcp:core:util_getinfo():157<debug> checking info
libfabric:60833:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60833:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:60833:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60833:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60833:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:60833:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60833:1713994966::tcp:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:60833:1713994966::tcp:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_RMA_EVENT, FI_SOURCE, FI_DIRECTED_RECV
[nid001500:60832:0:60832] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
libfabric:60833:1713994966::tcp:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:60833:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider tcp returned -61 (No data available)
libfabric:60833:1713994966::sockets:core:util_getinfo():157<debug> checking info
libfabric:60833:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60833:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:60833:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60833:1713994966::sockets:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:60833:1713994966::sockets:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE, FI_NAMED_RX_CTX, FI_DIRECTED_RECV
libfabric:60833:1713994966::sockets:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:60833:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60833:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:60833:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60833:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider sockets returned -61 (No data available)
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

[nid001500:60833:0:60833] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid:  60832) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(ucs_handle_error+0x294) [0x1518d0a6b394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x30564) [0x1518d0a6b564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x3082e) [0x1518d0a6b82e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x1518d2a928c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x1518cc03b9fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x1518cc03d312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x1518d3325c89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x1518d332b01c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x1518d332ce99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x1518d332d30e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x1518d569c635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x1518d52894c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x1518d52b5450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x1518d52b6105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x1518d52b3d7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x1518d52b41bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x1518d52b4386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
==== backtrace (tid:  60833) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(ucs_handle_error+0x294) [0x14eac848e394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x30564) [0x14eac848e564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x3082e) [0x14eac848e82e]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x1518d1b2b29d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
=================================
 3  /lib64/libpthread.so.0(+0x168c0) [0x14eaca4b58c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14eab031a9fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x14eab031c312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14eacad48c89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14eacad4e01c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14eacad4fe99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14eacad5030e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x14eacd0bf635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x14eacccac4c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14eacccd8450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14eacccd9105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14eacccd6d7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14eacccd71bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14eacccd7386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x14eac954e29d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
=================================
libfabric:60831:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libcoll-fi.so): libcoll-fi.so: cannot open shared object file: No such file or directory
libfabric:60831:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::core:core:ofi_register_provider():461<debug> no provider structure or name
libfabric:60831:1713994966::udp:core:fi_param_define_():251<debug> registered var iface
libfabric:60831:1713994966::core:core:ofi_register_provider():466<info> registering provider: udp (118.10)
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_waittime
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var conn_timeout
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_conn_retry
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_conn_map_sz
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_av_sz
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_cq_sz
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_eq_sz
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_affinity
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_enable
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_time
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_intvl
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_probes
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var iface
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_buf_sz
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var dgram_drop_rate
libfabric:60831:1713994966::core:core:ofi_register_provider():466<info> registering provider: sockets (118.10)
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var prov_name
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable prov_name=<not set>
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var iface
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_low_range
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var port_high_range
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable port_high_range=<not set>
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable port_low_range=<not set>
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var tx_size
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var rx_size
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable tx_size=<not set>
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable rx_size=<not set>
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_inject
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable max_inject=<not set>
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved=<not set>
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_saved_size
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable max_saved_size=<not set>
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var max_rx_size
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable max_rx_size=<not set>
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var nodelay
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable nodelay=<not set>
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var staging_sbuf_size
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var prefetch_rbuf_size
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var zerocopy_size
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable staging_sbuf_size=<not set>
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable prefetch_rbuf_size=<not set>
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable zerocopy_size=<not set>
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var trace_msg
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable trace_msg=<not set>
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var disable_auto_progress
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable disable_auto_progress=<not set>
libfabric:60831:1713994966::tcp:core:fi_param_define_():251<debug> registered var io_uring
libfabric:60831:1713994966::tcp:core:fi_param_get_():372<info> variable io_uring=<not set>
libfabric:60831:1713994966::core:core:ofi_register_provider():466<info> registering provider: tcp (118.10)
libfabric:60831:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_perf (118.10)
libfabric:60831:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_trace (118.10)
libfabric:60831:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_debug (118.10)
libfabric:60831:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:60831:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:60831:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_ZE not supported
libfabric:60831:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:60831:1713994966::core:core:ofi_hmem_init():414<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:60831:1713994966::core:core:fi_param_define_():251<debug> registered var hmem_disable_p2p
libfabric:60831:1713994966::core:core:fi_param_get_():372<info> variable hmem_disable_p2p=<not set>
libfabric:60831:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_hmem (118.10)
libfabric:60831:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_dmabuf_peer_mem (118.10)
libfabric:60831:1713994966::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_noop (118.10)
libfabric:60831:1713994966::core:core:ofi_register_provider():466<info> registering provider: off_coll (118.10)
libfabric:60831:1713994966::udp:core:util_getinfo():157<debug> checking info
libfabric:60831:1713994966::udp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60831:1713994966::udp:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:60831:1713994966::udp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60831:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider udp returned -61 (No data available)
libfabric:60831:1713994966::tcp:core:util_getinfo():157<debug> checking info
libfabric:60831:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60831:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:60831:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60831:1713994966::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60831:1713994966::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:60831:1713994966::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60831:1713994966::tcp:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:60831:1713994966::tcp:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_RMA_EVENT, FI_SOURCE, FI_DIRECTED_RECV
libfabric:60831:1713994966::tcp:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:60831:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider tcp returned -61 (No data available)
libfabric:60831:1713994966::sockets:core:util_getinfo():157<debug> checking info
libfabric:60831:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60831:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:60831:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60831:1713994966::sockets:core:ofi_check_info():1060<info> Unsupported capabilities
libfabric:60831:1713994966::sockets:core:ofi_check_info():1061<info> Supported: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE, FI_NAMED_RX_CTX, FI_DIRECTED_RECV
libfabric:60831:1713994966::sockets:core:ofi_check_info():1061<info> Requested: FI_RMA, FI_ATOMIC, FI_FENCE, FI_HMEM
libfabric:60831:1713994966::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:60831:1713994966::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:60831:1713994966::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:60831:1713994966::core:core:fi_getinfo_():1251<info> fi_getinfo: provider sockets returned -61 (No data available)
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

[nid001500:60831:0:60831] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid:  60831) ====
 0  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(ucs_handle_error+0x294) [0x14e151015394]
 1  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x30564) [0x14e151015564]
 2  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x3082e) [0x14e15101582e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x14e15303c8c0]
 4  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14e12c1b99fd]
 5  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x14e12c1bb312]
 6  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14e1538cfc89]
 7  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14e1538d501c]
 8  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14e1538d6e99]
 9  /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14e1538d730e]
10  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x14e155c46635]
11  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x14e1558334c3]
12  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14e15585f450]
13  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14e155860105]
14  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14e15585dd7d]
15  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14e15585e1bf]
16  /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14e15585e386]
17  cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9]
18  cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7]
19  /lib64/libc.so.6(__libc_start_main+0xef) [0x14e1520d529d]
20  cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a]
=================================
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
make: *** [Makefile:18: run] Error 139

I'm not sure I see anything more helpful on the backtraces? The added debug log info may help though.

So far, it looks like NVSHMEM tries to call fi_getinfo(), and this function fails to find any providers, but somehow the control flow continues to the point of a segfault.

From the extra fabric debug info, if I do grep -E "cannot open shared object file" out.txt | wc -l I see 200 such "missing" libs. For example, I don't see this shared lib in a standard ucx install:

libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libucx-fi.so): libucx-fi.so: cannot open shared object file: No such file or directory

Maybe the 200 missing *-fi.so shared libs is a hint? I'll see what I can do about experimenting with more explicit fabric options in my libfabric builds with spack, though I don't think I've needed that previously.

Edit: spack install libfabric@1.18.1+debug fabrics=mlx,sockets,tcp,udp,ucx ^ucx+debug and using that version via LD_LIBRARY_PATH seems to be of no use either.

@j-xiong
Copy link
Contributor

j-xiong commented Apr 25, 2024

@tylerjereddy Those warnings about missing *-fi.so are fine. Those files only exist for providers built as dl libraries, which most providers are by default not.

Based on the log, the libfabric installation doesn't have ucx provider at all. The top part of the stack trace indicates that the ucx error occurred in another thread (probably spawned by other part of NVSHMEM or OpenMPI).

 0  libucs.so.0(ucs_handle_error+0x294) [0x14e151015394]
 1  libucs.so.0(+0x30564) [0x14e151015564]
 2  libucs.so.0(+0x3082e) [0x14e15101582e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x14e15303c8c0]
 4  nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14e12c1b99fd]

@tylerjereddy
Copy link
Contributor Author

@j-xiong interesting...

If I do fi_info -l I see the providers I built libfabric with listed near the top:

ucx:
    version: 118.10
udp:
    version: 118.10
tcp:
    version: 118.10
sockets:
    version: 118.10
<snip>

However, fi_info -p ucx returns the same -61 that NVSHMEM reports in its C-level query, while fi_info -p for udp and sockets both succeed, dumping a bunch of output.

So, what's the conclusion here? The -l command suggests consistency with how I built libfabric fabrics=mlx,sockets,tcp,ucx,udp, but the -p probes suggest that there's a runtime issue picking up ucx only? Is the sane debugging approach to then just try a bunch different builds of ucx and load them in via LD_LIBRARY_PATH?

I'll take a shot at turning on even more ucx options at compile time, I see they have a detailed backtrace options in addition to debug, and some threading options I suppose I could try turning on.

@tylerjereddy
Copy link
Contributor Author

tylerjereddy commented Apr 25, 2024

Turning on a ton more ucx compile options does change the backtrace as shown below. Actually, that's a bit strange, I don't even see ucx on the backtrace proper anymore (it is still absent from fi_info -p ucx though).

spack install ucx+cuda+gdrcopy+debug+backtrace_detail+assertions+dm+logging+parameter_checking+thread_multiple %gcc@12.2.0

/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available

[nid001249:102664:0:102664] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
[nid001249:102665:0:102665] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
[nid001249:102666:0:102666] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid: 102663) ====
 0 0x00000000000168c0 __funlockfile()  ???:0
 1 0x000000000000b9fd nvshmemt_libfabric_finalize()  libfabric.cpp:0
 2 0x000000000000d312 nvshmemt_init()  ???:0
 3 0x00000000000e6c89 nvshmemi_transport_init()  :0  
 4 0x00000000000ec01c nvshmemi_common_init()  :0  
 5 0x00000000000ede99 nvshmemi_try_common_init()  :0  
 6 0x00000000000ee30e nvshmemx_host_init()  ???:0
 7 0x000000000052e635 cufftMpDestroyReshape()  ???:0
 8 0x000000000011b4c3 cufftXtSetCallbackSharedSize()  ???:0
 9 0x0000000000147450 cufftXtMakePlanGuru64()  ???:0
10 0x0000000000148105 cufftXtMakePlanMany()  ???:0
11 0x0000000000145d7d cufftMakePlanMany64()  ???:0
12 0x00000000001461bf cufftMakePlanMany()  ???:0
13 0x0000000000146386 cufftMakePlan3d()  ???:0
14 0x0000000000406922 run_r2c_c2r_slabs()  ???:0
15 0x0000000000407fbb main()  ???:0
16 0x000000000003529d __libc_start_main()  ???:0
17 0x0000000000405a5a _start()  /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120

@j-xiong
Copy link
Contributor

j-xiong commented Apr 25, 2024

The output of fi_info with the -l option is generated differently from without the -l option. With the -l option, no attempt is made to check if the providers are usable. It just returns the name and version of the available providers.

The fact that fi_info -p ucx returns -61 indicates that either the ucx provider doesn't exist or is not usable. The provider appears in the output of fi_info -l suggests the latter. What puzzled me was that the output with FI_LOG_LEVEL=debug doesn't have any line related to the ucx provider. For example, for the sockets provider we can see:

libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_waittime
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var conn_timeout
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_conn_retry
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_conn_map_sz
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_av_sz
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_cq_sz
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_eq_sz
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_affinity
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_enable
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_time
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_intvl
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_probes
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var iface
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_buf_sz
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var dgram_drop_rate

The ucx provider has similar parameter definitions at the very beginning of the provider initialization code and we expect to see similar output for the ucx provider.

Could you try again with FI_LOG_LEVEL=debug fi_info -p ucx?

@tylerjereddy
Copy link
Contributor Author

Sure, see the attached file below (it is too large to paste in its entirety). I do see libfabric:114048:1714069851::ucx:core:ucx_getinfo():256<info> no ucx device is found. before the -61 return value, maybe that's what you're looking for.

out.txt

Are there some common causes of ucx providers not being usable?

@j-xiong
Copy link
Contributor

j-xiong commented Apr 25, 2024

Yes, that's the info I want to see.

The ucx provider looks for devices under /sys/class/infiniband with vendor id 0x15b3 (NVIDIA). It appears that no such device exists on your system.

See the code here: https://github.com/ofiwg/libfabric/blob/v1.18.x/prov/ucx/src/ucx_init.c#L207

@tylerjereddy
Copy link
Contributor Author

The /sys/class/infiniband folder is indeed empty, but is that surprising given that this is a Cray Slingshot 11 machine that doesn't use IFB? (the fact that we're using SS11 has been the source of many problems, and is indeed the reason we need super-new OpenMPI as well, for libfabric support).

@j-xiong
Copy link
Contributor

j-xiong commented Apr 25, 2024

Not surprising at all. That just confirms that libfabric not picking up the ucx provider is an expected behavior.

Back to the segfault, what is going on inside nvshmemt_libfabric_finalize()?

[nid001249:102664:0:102664] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
[nid001249:102665:0:102665] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
[nid001249:102666:0:102666] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid: 102663) ====
 0 0x00000000000168c0 __funlockfile()  ???:0
 1 0x000000000000b9fd nvshmemt_libfabric_finalize()  libfabric.cpp:0
 2 0x000000000000d312 nvshmemt_init()  ???:0
 3 0x00000000000e6c89 nvshmemi_transport_init()  :0  
 4 0x00000000000ec01c nvshmemi_common_init()  :0  
 5 0x00000000000ede99 nvshmemi_try_common_init()  :0  
 6 0x00000000000ee30e nvshmemx_host_init()  ???:0

@tylerjereddy
Copy link
Contributor Author

Using good old "printf peppering," it seems to be this segment of that function that is called last (control flow never exceeds checkpoint 7b):

Final section of C++ function source that gets printed before crash
1440         printf("** nvshmemt_libfabric_finalize checkpoint 7\n");
1441 
1442         if (libfabric_state->addresses) {
1443             for (int i = 0; i < NVSHMEMT_LIBFABRIC_DEFAULT_NUM_EPS; i++) {
1444                 printf("** nvshmemt_libfabric_finalize checkpoint 7b\n");
1445                 status = fi_close(&libfabric_state->addresses[i]->fid);
1446                 printf("** nvshmemt_libfabric_finalize checkpoint 7c\n");
1447                 if (status) {
1448                     NVSHMEMI_WARN_PRINT("Unable to close fabric address vector: %d: %s\n", status,
1449                                         fi_strerror(status * -1));
1450                 }
1451                 printf("** nvshmemt_libfabric_finalize checkpoint 7d\n");
1452             }
1453         }
1454         printf("** nvshmemt_libfabric_finalize checkpoint 8\n");

Based on grepping the output log with grep -E "checkpoint" out.txt, where this is repeated for each rank it seems:

** nvshmemt_libfabric_finalize checkpoint 1
** nvshmemt_libfabric_finalize checkpoint 2
** nvshmemt_libfabric_finalize checkpoint 3
** nvshmemt_libfabric_finalize checkpoint 4
** nvshmemt_libfabric_finalize checkpoint 6
** nvshmemt_libfabric_finalize checkpoint 7
** nvshmemt_libfabric_finalize checkpoint 7b

So, crash at fi_close() call, or one of the structure member accesses therein?

@j-xiong
Copy link
Contributor

j-xiong commented Apr 25, 2024

So the failure happened when closing the endpoints.

Question: Since fi_getinfo() previously returned -61, is there another fi_getinfo() call that succeeded later? If so, which provider was actually being used? If not, how it comes to the point to close the endpoint which is not opened before?

@seth-howell
Copy link

Hi all,

The segfault is happening in an error path inside of nvshmemt_init. It seems we are missing a null check for endpoints in an error path. (Instancing a bug for this internally). We should fail gracefully here, but do not.

I see two issues before this one that lead to us entering this error path in the first place though:

  1. Since you are using slingshot 11, you need to have the CXI provider built and enabled in your libfabric installation. We can't find the CXI provider which supports the features called out here. It seems that provider wasn't compiled into your libfabric build @tylerjereddy. Looking at the source, I don't see the CXI provider in this repository until somewhere between 1.20.2 and 1.21.0. I don't know anything beyond this about compiling or loading the CXI provider with versions of libfabric <1.21.0. I have only used pre-compiled libfabric + CXI.

  2. I see errors with respect to not having cuda support. This is also a compile-time option.

[
](libfabric:114111:1714069859::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_CUDA not supported)

These errors will disable HMEM CUDA support required by NVSHMEM to register device memory. This is a configure-time option in libfabric. if you pass --with-cuda[=dir] to configure, it will fail if it can't find the directory.

@tylerjereddy
Copy link
Contributor Author

CXI is closed source, so for now I switch to our HPC module libfabric/1.15.2.0, which does have CXI, but apparently not the CUDA requirement mentioned above (see output log below). spack doesn't even have a CUDA option for libfabric config, guess I'll need to check with local HPC to see if they can do it for me or send me their build instructions for CXI-enabled libfabric so I can also add CUDA.

out_with_cxi.txt

@raffenet
Copy link
Contributor

@tylerjereddy #9835 the CXI provider is open source, but it cannot be built on any machine we have come across without significant hacks.

@tylerjereddy
Copy link
Contributor Author

Ok, I'm referring to the build system we usually use which indicates: https://github.com/spack/spack/blob/develop/var/spack/repos/builtin/packages/libfabric/package.py#L87

CXI is a closed source package and only exists when an external.

maybe they haven't updated yet, but the effect is the same I guess

@tylerjereddy
Copy link
Contributor Author

On latest libfabric main branch:

  • ./autogen.sh
  • ./configure --prefix=/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom --enable-debug --enable-cxi=/usr --disable-opx --disable-psm2 --disable-efa --disable-rxm --disable-sockets --enable-psm3 --enable-tcp --enable-verbs --with-cuda=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3

I get:

configure: *** Configuring cxi provider
checking cxi_prov_hw.h usability... no
checking cxi_prov_hw.h presence... yes
configure: WARNING: cxi_prov_hw.h: present but cannot be compiled
configure: WARNING: cxi_prov_hw.h:     check for missing prerequisite headers?
configure: WARNING: cxi_prov_hw.h: see the Autoconf documentation
configure: WARNING: cxi_prov_hw.h:     section "Present But Cannot Be Compiled"
configure: WARNING: cxi_prov_hw.h: proceeding with the compiler's result
configure: WARNING:     ## ------------------------------------------ ##
configure: WARNING:     ## Report this to ofiwg@lists.openfabrics.org ##
configure: WARNING:     ## ------------------------------------------ ##

The libfabric install we have was done by HPE directly apparently, and was done on a version that preceeds direct support for CXI, so may be patched in some way as Seth had suggested to me at some point. --enable-cxi=/usr is based on HPC support basically just finding some of the relevant files for CXI in subdirs there.

@raffenet
Copy link
Contributor

@tylerjereddy the configure issues are fixed in #9793 if you want to try and cherry-pick those commits. Make sure to re-run autogen.sh.

@tylerjereddy
Copy link
Contributor Author

tylerjereddy commented Apr 27, 2024

@raffenet Cool, I ended up having to use https://github.com/thomasgillis/libfabric/tree/dev-cxi after following some more cross-linked issues/PRs that complain about LANL-specific problems, and this does seem to get me a new segfault at runtime at least, and I think the problem is still related to the CUDA provider based on the fi_info -p cuda output in the log below.

out_cxi_and_cuda_libfabric.txt

It looks like libfabric does correctly identify the number of CUDA devices based on the verbose output there, so some progress is being made. Just before the call to fi_info -p cuda "fails" with -61, I see libfabric:60885:1714254829::core:core:fi_getinfo_():1304<info> fi_getinfo: provider ofi_mrail returned -61 (No data available). Do I need that ofi_mrail provider as well here, or?

Edit: here's my current configure line for libfabric:

./configure --prefix=/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom --enable-cxi=/usr --disable-opx --disable-psm2 --disable-efa --disable-rxm --disable-sockets --enable-psm3 --enable-tcp --enable-verbs --with-cuda=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3

@tylerjereddy
Copy link
Contributor Author

cc @hppritcha as well perhaps

@tylerjereddy
Copy link
Contributor Author

tylerjereddy commented Apr 28, 2024

I checked that addition of --enable-mrail to the config line above followed by clean rebuild doesn't help, fi_info -p cuda and finfo -p ofi_mrail both still fail the same way it seems.

out_cxi_and_cuda_and_mrail_libfabric.txt

I think I'm still misunderstanding something though, because I can get the same kinds of errors with fi_info -p junk, rather than getting an error about a provider name that can't possibly exist. That's pretty confusing! What's the idiomatic way to probe the CUDA support of libfabric then? Something must be different about the CUDA support..

I wonder if this is relevant:

libfabric:9451:1714321826::cxi:fabric:cxip_nic_get_ss_env_get_vni():23<info> nid001204: SLINGSHOT_VNIS not found

@j-xiong
Copy link
Contributor

j-xiong commented Apr 29, 2024

@tylerjereddy You don't need to enable the mrail provider. The line fi_getinfo: provider ofi_mrail returned -61 in the debug trace is normal which just says the mrail provider can't be used.

There is no provider called cuda so passing -p cuda is equivalent to use -p junk. Usually, the provider would pick the GPU/device interface (or HMEM interface in libfabric's term) based on either application input (e.g. iface passed to fi_mr_regattr) or some autodetect logic. You can force to use cuda only by setting FI_HMEM=cuda. That setting will prevent other HMEM interfaces from being initialized.

@tylerjereddy
Copy link
Contributor Author

I get the same backtraces with export FI_HMEM=cuda, darn:

out_fi_hmem_cuda.txt

@tylerjereddy
Copy link
Contributor Author

@seth-howell I built NVSHMEM with NVSHMEM_DEBUG=1 (along with verbose libfabric settings) and tried to pull out some relevant portions of the 150 MB log file for the 2-node cuFFTMp reproducer. In this scenario, it looks like we get things erroring out instead of hitting a segfault. This output isn't in order, just grepped based on things that look suspicious.

Looks like a mixture of remote memory access (RMA), GDRCopy, CXI, and CUDA-related complaints, but not sure if it is clear what needs to be addressed.

libfabric:98917:1714598377::cxi:ep_data:cxip_rma_emit_dma():329<warn> nid001448: TXC (0x1b32:0): Failed to emit dma command: -11:Resource temporarily unavailable
libfabric:98917:1714598377::cxi:ep_data:cxip_rma_common():635<warn> nid001448: TXC (0x1b32:0): DMA RMA write failed: buf=0x150278b80600 len=80 rkey=0 roffset=0x18b806a0 nic=0x1b72 pid=0 pid_idx=117
nid001449:67129:67129 [2] NVSHMEM INFO [6] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x4e147b0
nid001448:98917:98917 [1] NVSHMEM INFO [1] status 0 cudaErrorInvalidValue 1 cudaErrorInvalidSymbol 13 cudaErrorInvalidMemcpyDirection 21 cudaErrorNoKernelImageForDevice 209
libfabric:98918:1714598371::core:core:cuda_gdrcopy_dev_unregister():362<warn> gdr_unmap failed! error: Invalid argument
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:nvshmemt_libfabric_rma:517: Received an error when trying to post an RMA operation.
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/include/internal/common/nvshmem_internal.h:nvshmemi_process_multisend_rma:302: aborting due to error in process_channel_dma
libfabric:67128:1714598429::core:core:cuda_gdrcopy_dev_register():333<warn> gdr_map failed! error: Cannot allocate memory
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1706 GDRCopy requested, but unused by transport. Disabling
libfabric:98717:1714598347::core:core:fi_param_get_():372<info> variable hmem_cuda_use_gdrcopy=<not set>
libfabric:67130:1714598364::core:core:fi_param_get_():399<info> read bool var hmem_cuda_use_gdrcopy=1

@tylerjereddy
Copy link
Contributor Author

Here's a shorter ouput log with different debug/verbosity settings and also export GDRCOPY_ENABLE_LOGGING=1 turned on. It seems to have extra output with ERR: mh is not mapped yet before hitting the segfault via NVSHMEM->CXI->CUDA (gdrcopy).

/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/bin/fi_info
fi_info -l:
verbs:
    version: 120.0
cxi:
    version: 0.1
psm3:
    version: 305.1010
ofi_rxd:
    version: 120.0
shm:
    version: 120.0
udp:
    version: 120.0
tcp:
    version: 120.0
ofi_hook_perf:
    version: 120.0
ofi_hook_trace:
    version: 120.0
ofi_hook_debug:
    version: 120.0
ofi_hook_noop:
    version: 120.0
ofi_hook_hmem:
    version: 120.0
ofi_hook_dmabuf_peer_mem:
    version: 120.0
off_coll:
    version: 120.0
sm2:
    version: 120.0
ofi_mrail:
    version: 120.0
fi_info -p cxi:
provider: cxi
    fabric: cxi
    domain: cxi0
    version: 0.1
    type: FI_EP_RDM
    protocol: FI_PROTO_CXI
provider: cxi
    fabric: cxi
    domain: cxi1
    version: 0.1
    type: FI_EP_RDM
    protocol: FI_PROTO_CXI
rm -rf cufftmp_r2c_c2r_slabs_GROMACS
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin/../bin/nvcc cufftmp_r2c_c2r_slabs_GROMACS.cu -o cufftmp_r2c_c2r_slabs_GROMACS -std=c++17 --generate-code arch=compute_70,code=sm_70 --generate-code arch=compute_80,code=sm_80 --generate-code arch=compute_90,code=sm_90 -I/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp -I/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include -I/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/include -lcuda -L/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib -L/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib  -lcufftMp -lnvshmem_device -lnvshmem_host -L/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib -lmpi
LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib:/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-hc255f5j4fcqhtufeisjj3pytrkv4dqt/lib/ucx:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-hc255f5j4fcqhtufeisjj3pytrkv4dqt/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/lib64:/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/lib:/opt/cray/pe/papi/7.0.0.2/lib64:/opt/cray/libfabric/1.15.2.0/lib64:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib" /lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin/mpirun -oversubscribe -n 8 -N 4 cufftmp_r2c_c2r_slabs_GROMACS
Hello from rank 3/8 using GPU 3
Hello from rank 7/8 using GPU 3
Hello from rank 4/8 using GPU 0
Hello from rank 2/8 using GPU 2
Hello from rank 6/8 using GPU 2
Hello from rank 0/8 using GPU 0
Hello from rank 1/8 using GPU 1
Hello from rank 5/8 using GPU 1
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
[nid001448:105413:0:105413] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x18)
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
==== backtrace (tid: 105413) ====
 0 0x00000000000168c0 __funlockfile()  ???:0
 1 0x0000000000001aa7 gdr_unmap()  ???:0
 2 0x0000000000032d92 cuda_gdrcopy_dev_unregister()  :0
 3 0x00000000000a488f cxip_unmap()  :0
 4 0x000000000008c165 cxip_rma_cb()  cxip_rma.c:0
 5 0x00000000000adfe5 cxip_evtq_progress()  :0
 6 0x0000000000081695 cxip_ep_progress()  :0
 7 0x000000000008b6c9 cxip_cntr_read()  cxip_cntr.c:0
 8 0x000000000000e7d3 nvshmemt_libfabric_quiet()  /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:441
 9 0x00000000000d653a nvshmem_quiet()  /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/comm/quiet.cpp:51
10 0x000000000004525d nvshmemi_barrier()  /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:19
11 0x00000000000456b3 nvshmem_barrier_all()  /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:39
12 0x00000000000456b3 nvshmem_barrier_all()  /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:41
13 0x0000000000500a66 cufftMpDestroyReshape()  ???:0
14 0x00000000004ff598 cufftMpDestroyReshape()  ???:0
15 0x000000000015893a cufftMpAttachComm()  ???:0
16 0x00000000004e058f cufftMpDestroyReshape()  ???:0
17 0x00000000004e0a85 cufftMpDestroyReshape()  ???:0
18 0x000000000014cb6e cufftMpAttachComm()  ???:0
19 0x000000000011bf4f cufftXtSetCallbackSharedSize()  ???:0
20 0x0000000000147511 cufftXtMakePlanGuru64()  ???:0
21 0x0000000000148105 cufftXtMakePlanMany()  ???:0
22 0x0000000000145d7d cufftMakePlanMany64()  ???:0
23 0x00000000001461bf cufftMakePlanMany()  ???:0
24 0x0000000000146386 cufftMakePlan3d()  ???:0
25 0x0000000000406619 run_r2c_c2r_slabs()  ???:0
26 0x00000000004079c7 main()  ???:0
27 0x000000000003529d __libc_start_main()  ???:0
28 0x000000000040573a _start()  /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120
=================================
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
make: *** [Makefile:18: run] Error 139

@tylerjereddy
Copy link
Contributor Author

It looks like an NVIDIA engineer responded to the ERR: mh is not mapped yet type error at NVIDIA/gdrcopy#242 (comment), but needed a custom build of gdrcopy to get to the bottom of it. I'm assuming that we're ok to use spack find -v gdrcopy

gdrcopy@2.3+cuda build_system=makefile cuda_arch=none patches=c5efec1

@tylerjereddy
Copy link
Contributor Author

When I build gdrcopy master branch by hand, in the interest of pulling in NVIDIA/gdrcopy#248, and then rebuild NVSHMEM 2.10.1 to point at that version of gdrcopy, I do see different behavior.

In particular, I see the cuFFTMp reproducer code hang on two nodes instead of segfaulting. Not sure if this might be diagnostically useful cc @pakmarkthub.

@tylerjereddy
Copy link
Contributor Author

tylerjereddy commented May 6, 2024

I guess the Slingshot 11 network was having problems (based on feedback from local HPC). When I re-run today with gdrcopy master branch I see this output/backtrace:

<snip>
Hello from rank 7/8 using GPU 3
Hello from rank 4/8 using GPU 0
Hello from rank 5/8 using GPU 1
Hello from rank 6/8 using GPU 2
Hello from rank 3/8 using GPU 3
Hello from rank 1/8 using GPU 1
Hello from rank 2/8 using GPU 2
Hello from rank 0/8 using GPU 0
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.

ERR:  mh is not mapped yet
[nid001217:115514:0:115701] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x18)
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
==== backtrace (tid: 115701) ====
 0 0x00000000000168c0 __funlockfile()  ???:0
 1 0x0000000000001aa7 gdr_unmap()  ???:0
 2 0x0000000000032d92 cuda_gdrcopy_dev_unregister()  :0
 3 0x00000000000a488f cxip_unmap()  :0
 4 0x000000000008c165 cxip_rma_cb()  cxip_rma.c:0
 5 0x00000000000adfe5 cxip_evtq_progress()  :0
 6 0x0000000000081695 cxip_ep_progress()  :0
 7 0x000000000008b599 cxip_cntr_readerr()  cxip_cntr.c:0
 8 0x000000000000dfc2 nvshmemt_libfabric_progress()  /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:446
 9 0x00000000000e4bad progress_transports()  /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/proxy/proxy.cpp:963
10 0x00000000000e51b9 progress()  /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/proxy/proxy.cpp:992
11 0x000000000000a6ea start_thread()  ???:0
12 0x0000000000117a6f __GI___clone()  ???:0
=================================

@tylerjereddy
Copy link
Contributor Author

I believe using the correct CXI provider build of libfabric resolved the original segfaults here.

We're now seeing other crashes in gdrcopy, quite possibly from libfabric incantations, per the cross-linked issue above. However, I think it is best if I close this and open a separate issue if I manage to get a minimum viable reproducer for libfabric side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants