Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault with numpy on POWER9 (only) when using FlexiBLAS #17

Open
boegel opened this issue May 24, 2021 · 13 comments
Open

segmentation fault with numpy on POWER9 (only) when using FlexiBLAS #17

boegel opened this issue May 24, 2021 · 13 comments

Comments

@boegel
Copy link

boegel commented May 24, 2021

I'm seeing a Segmentation fault when running the numpy 1.20.3 tests when using FlexiBLAS 3.0.4 with OpenBLAS 0.3.15, but not when linking to OpenBLAS 0.3.15 directly, which tells me FlexiBLAS is somehow causing the segmentation fault...

I'm not seeing this problem on Intel (Haswell, Skylake X), AMD (Rome), or Arm (AWS Graviton2).

Here's a partial backtrace I obtained when running the numpy tests via gdb:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffff4887530 in dnrm2_k () from /home/centos/EasyBuild/software/OpenBLAS/0.3.15-GCC-10.3.0/lib/../lib64/libopenblas.so.0
Missing separate debuginfos, use: yum debuginfo-install libxcrypt-4.1.1-4.el8.ppc64le
(gdb) bt
#0  0x00007ffff4887530 in dnrm2_k () from /home/centos/EasyBuild/software/OpenBLAS/0.3.15-GCC-10.3.0/lib/../lib64/libopenblas.so.0
#1  0x00007ffff453d788 in dnrm2_ () from /home/centos/EasyBuild/software/OpenBLAS/0.3.15-GCC-10.3.0/lib/../lib64/libopenblas.so.0
#2  0x00007ffff62cfd9c in dnrm2_ () from /home/centos/EasyBuild/software/FlexiBLAS/3.0.4-GCC-10.3.0/lib64/libflexiblas.so.3
#3  0x00007ffff4d7816c in dgeev_ () from /home/centos/EasyBuild/software/OpenBLAS/0.3.15-GCC-10.3.0/lib/../lib64/libopenblas.so.0
#4  0x00007ffff639e8e4 in dgeev_ () from /home/centos/EasyBuild/software/FlexiBLAS/3.0.4-GCC-10.3.0/lib64/libflexiblas.so.3
#5  0x00007fff7364b334 in call_dgeev (params=0x7ffffffe63b0) at numpy/linalg/umath_linalg.c.src:2292
#6  DOUBLE_eig_wrapper (JOBVL=JOBVL@entry=78 'N', JOBVR=JOBVR@entry=86 'V', args=0x7fff50dad120, dimensions=<optimized out>, steps=<optimized out>) at numpy/linalg/umath_linalg.c.src:2292
#7  0x00007fff7364c02c in DOUBLE_eig (args=<optimized out>, dimensions=<optimized out>, steps=<optimized out>, __NPY_UNUSED_TAGGEDfunc=<optimized out>) at numpy/linalg/umath_linalg.c.src:2336
#8  0x00007ffff6a5d294 in PyUFunc_GeneralizedFunction (op=0x7ffffffe8200, kwds=0x0, args=0x7fff50dad0f0, ufunc=0x0) at numpy/core/src/umath/ufunc_object.c:2986
#9  PyUFunc_GenericFunction_int (ufunc=<optimized out>, ufunc@entry=0x7fff736c1130, args=args@entry=0x7fff50f88820, kwds=kwds@entry=0x7fff50e79c00, op=op@entry=0x7ffffffe8200)
    at numpy/core/src/umath/ufunc_object.c:3119
#10 0x00007ffff6a5f740 in ufunc_generic_call (ufunc=0x7fff736c1130, args=0x7fff50f88820, kwds=0x7fff50e79c00) at numpy/core/src/umath/ufunc_object.c:4747
...

This only happens when numpy is linked with FlexiBLAS:

$ ldd $(python -c "import numpy; print(numpy.core._multiarray_umath.__file__)") | grep blas
	libflexiblas.so.3 => /home/centos/EasyBuild/software/FlexiBLAS/3.0.4-GCC-10.3.0/lib64/libflexiblas.so.3 (0x0000200000570000)

Any ideas on what may be causing this segmentation fault?

I tried using ulimit -s unlimited (default is 8192 on that system), no change.

After export FLEXIBLAS=netlib to make FlexiBLAS use the fallback netlib backend, the segmentation fault doesn't happen either...

@grisuthedragon
Copy link
Member

Can you provide the backtrace with debug information? How does it look like in valgrind?

@boegel
Copy link
Author

boegel commented May 25, 2021

Backtrace with debug info:

#0  dnrm2_k (n=2, x=<optimized out>, inc_x=1) at ../kernel/power/../arm/nrm2.c:69
#1  0x00007ffff453d788 in dnrm2_ (N=<optimized out>, x=<optimized out>, INCX=<optimized out>) at nrm2.c:61
#2  0x00007ffff62cf9fc in dnrm2_ (n=<optimized out>, x=<optimized out>, incx=<optimized out>) at /tmp/centos/FlexiBLAS/3.0.4/GCC-10.3.0/flexiblas-3.0.4/src/wrapper_blas_gnu.c:2899
#3  0x00007ffff4d788ec in dgeev (jobvl=..., jobvr=..., n=2, a=..., lda=<optimized out>, wr=..., wi=..., vl=..., ldvl=2, vr=..., ldvr=2, work=..., lwork=260, info=<optimized out>, _jobvl=140737323525740, _jobvr=8) at dgeev.f:490
#4  0x00007ffff639e594 in dgeev_ (jobvl=0x7ffffffe655c "NV", jobvr=0x7ffffffe655d "V", n=0x7ffffffe6548, a=0x7fff650fc3a0, lda=0x7ffffffe654c, wr=0x7fff650fc3c0, wi=0x7fff650fc3d0, vl=0x7fff650fc3e0, ldvl=0x7ffffffe6550, vr=0x7fff650fc3e0, ldvr=0x7ffffffe6554,
    work=0x7fff650458d0, lwork=0x7ffffffe6558, info=0x7ffffffe6560) at /tmp/centos/FlexiBLAS/3.0.4/GCC-10.3.0/flexiblas-3.0.4/src/lapack_interface/wrapper/dgeev.c:80
#5  0x00007fff7364b334 in call_dgeev (params=0x7ffffffe6500) at numpy/linalg/umath_linalg.c.src:2292
#6  DOUBLE_eig_wrapper (JOBVL=JOBVL@entry=78 'N', JOBVR=JOBVR@entry=86 'V', args=0x7fff5142d4a0, dimensions=<optimized out>, steps=<optimized out>) at numpy/linalg/umath_linalg.c.src:2292
#7  0x00007fff7364c02c in DOUBLE_eig (args=<optimized out>, dimensions=<optimized out>, steps=<optimized out>, __NPY_UNUSED_TAGGEDfunc=<optimized out>) at numpy/linalg/umath_linalg.c.src:2336
#8  0x00007ffff6a5d294 in PyUFunc_GeneralizedFunction (op=0x7ffffffe8270, kwds=0x0, args=0x7fff5142d470, ufunc=0x0) at numpy/core/src/umath/ufunc_object.c:2986
#9  PyUFunc_GenericFunction_int (ufunc=<optimized out>, ufunc@entry=0x7fff736c1130, args=args@entry=0x7fff5005aca0, kwds=kwds@entry=0x7fff50e7a700, op=op@entry=0x7ffffffe8270) at numpy/core/src/umath/ufunc_object.c:3119
#10 0x00007ffff6a5f740 in ufunc_generic_call (ufunc=0x7fff736c1130, args=0x7fff5005aca0, kwds=0x7fff50e7a700) at numpy/core/src/umath/ufunc_object.c:4747
...

I'll look into valgrind too.

@boegel
Copy link
Author

boegel commented May 25, 2021

@grisuthedragon No segmentation fault when running via Valgrind it seems (though a bunch of unrelated "Invalid read of size 4" cases in Python itself are reported). So that's a dead end I think, I'm afraid...

@grisuthedragon
Copy link
Member

That's weird. I try to compile FB + Numpy on my power system asap.

@boegel
Copy link
Author

boegel commented May 25, 2021

To quickly trigger the segfault, you can use python -c "import numpy as np; np.linalg.test()".

@Flamefire
Copy link

Flamefire commented May 26, 2021

I tried this too on a real ppc machine and the minimal reproducer for "issues" I got is python -c "import numpy as np; np.linalg.test(verbose=3, extra_argv=['-k', 'TestEigvals and test_sq_cases'])" which either segfaults with a double free or fails the test (works with OpenBLAS directly)

I also see messages in stderr:

 ** On entry to DGEHRD parameter number  8 had an illegal value
 ** On entry to DGEHRD parameter number  8 had an illegal value
 ** On entry to DORGHR parameter number  8 had an illegal value
 ** On entry to DGEHRD parameter number  8 had an illegal value
 ** On entry to DGEHRD parameter number  8 had an illegal value
 ** On entry to DORGHR parameter number  8 had an illegal value
 ** On entry to DGEHRD parameter number  8 had an illegal value
 ** On entry to DGEHRD parameter number  8 had an illegal value
 ** On entry to DORGHR parameter number  8 had an illegal value
 ** On entry to ZGEHRD parameter number  5 had an illegal value
 ** On entry to ZHSEQR parameter number  7 had an illegal value

Those are from the numpy xerblas error handler and I guess those are a good hint on to the real problem

@Flamefire
Copy link

More minimal reproducer: python -c "from numpy import array, linalg; linalg.eigvals(array([[1., 2.], [3., 4.]]))"

I suspect a stackoverflow due to GCC misoptimizing OpenBLAS which becomes apparent by FlexiBLAS as FlexiBLAS uses a the stack to save a register which gets overwritten by the bug. I reported this as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

@grisuthedragon
Copy link
Member

@Flamefire
Thanks for the work and identifying, where this behaviour comes from. Lets wait until the gcc guys react and see how they see this problem.

@Flamefire
Copy link

The IBM compiler guys are looking into this. It seems to be indeed a compiler issue since GCC 7. So I'd say this can be closed as there is nothing short of providing a better error message that can be done here

@boegel
Copy link
Author

boegel commented Jan 9, 2022

@Flamefire Any updates on this?

@boegel
Copy link
Author

boegel commented Oct 12, 2022

Small update here from our side: we've side-stepped this problem by compiling OpenBLAS with -fstack-protector-strong on POWER, see easybuilders/easybuild-easyconfigs#15885 for more information

@Flamefire
Copy link

The GCC developers determined this a bug in the usage related to the Fortran calling convention:

As described in (https://gcc.gnu.org/onlinedocs/gfortran/Argument-passing-conventions.html), since the first parameter to DGEBAL is of type CHARACTER, there is an extra hidden argument. Change the call to DGEBAL from dgebal (the flexiBLAS wrapper routine) to take an extra argument. This causes the compiler to allocate a parameter save area in dgebal's frame, as there are now 9 parameters but only 8 parameter registers.

@grisuthedragon
Copy link
Member

@Flamefire
I know about this extra arguments, but due to compatibility reasons in the early times of FlexiBLAS, we neglected them. Even using CBLAS/LAPACKE from the reference implementation can lead to this issue, since they "forget" about these additional parameters as well.

For FlexiBLAS I will do some tests and, if successful, integrate it in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants