Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Tests fail with "Core dumped" when using latest jax(lib) 0.4.28 #578

Closed
GaetanLepage opened this issue May 11, 2024 · 16 comments
Closed

Comments

@GaetanLepage
Copy link

GaetanLepage commented May 11, 2024

Problem description

When running the test suite while the latest jax/jaxlib (v0.4.28) is installed, pytest will suddenly crash with Aborted (core dumped) after the tests have (supposedly) all succeeded.
This weird behavior doesn't happen if I uninstall the jax library (the tests are then skipped and pytest quits without error).

More interestingly, pytest runs fine when I use the latest commit (c545446 as of today). It only occurs on tag v0.9.2

Context: Updating jax in the nixpkgs repo.

Reproducible example code

No response

@GaetanLepage
Copy link
Author

Logs:

Executing pytestCheckPhase
[  2%] Building CXX object tests/CMakeFiles/nanobind-static.dir/__/src/nb_internals.cpp.o
[  4%] Building CXX object tests/CMakeFiles/inter_module.dir/inter_module.cpp.o
[  6%] Building CXX object tests/CMakeFiles/nanobind-static.dir/__/src/nb_ndarray.cpp.o
[  8%] Building CXX object tests/CMakeFiles/nanobind-static.dir/__/src/nb_func.cpp.o
[ 10%] Building CXX object tests/CMakeFiles/nanobind-static.dir/__/src/nb_type.cpp.o
[ 14%] Building CXX object tests/CMakeFiles/nanobind-static.dir/__/src/common.cpp.o
[ 14%] Building CXX object tests/CMakeFiles/nanobind-static.dir/__/src/nb_enum.cpp.o
[ 16%] Building CXX object tests/CMakeFiles/nanobind-static.dir/__/src/nb_static_property.cpp.o
[ 18%] Building CXX object tests/CMakeFiles/nanobind-static.dir/__/src/error.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/nanobind-static.dir/__/src/trampoline.cpp.o
[ 22%] Building CXX object tests/CMakeFiles/nanobind-static.dir/__/src/implicit.cpp.o
[ 25%] Linking CXX shared library libinter_module.so
[ 25%] Built target inter_module
[ 27%] Linking CXX static library libnanobind-static.a
[ 27%] Built target nanobind-static
[ 29%] Building CXX object tests/CMakeFiles/test_functions_ext.dir/test_functions.cpp.o
[ 31%] Building CXX object tests/CMakeFiles/test_stl_ext.dir/test_stl.cpp.o
[ 33%] Building CXX object tests/CMakeFiles/test_classes_ext.dir/test_classes.cpp.o
[ 35%] Building CXX object tests/CMakeFiles/test_bind_vector_ext.dir/test_stl_bind_vector.cpp.o
[ 39%] Building CXX object tests/CMakeFiles/test_chrono_ext.dir/test_chrono.cpp.o
[ 39%] Building CXX object tests/CMakeFiles/test_bind_map_ext.dir/test_stl_bind_map.cpp.o
[ 41%] Building CXX object tests/CMakeFiles/test_eval_ext.dir/test_eval.cpp.o
[ 43%] Building CXX object tests/CMakeFiles/test_ndarray_ext.dir/test_ndarray.cpp.o
[ 45%] Building CXX object tests/CMakeFiles/test_exception_ext.dir/test_exception.cpp.o
[ 47%] Building CXX object tests/CMakeFiles/test_inter_module_1_ext.dir/test_inter_module_1.cpp.o
[ 50%] Building CXX object tests/CMakeFiles/test_issue_ext.dir/test_issue.cpp.o
[ 54%] Building CXX object tests/CMakeFiles/test_inter_module_2_ext.dir/test_inter_module_2.cpp.o
[ 54%] Building CXX object tests/CMakeFiles/test_eigen_ext.dir/test_eigen.cpp.o
[ 56%] Building CXX object tests/CMakeFiles/test_intrusive_ext.dir/test_intrusive.cpp.o
[ 58%] Building CXX object tests/CMakeFiles/test_make_iterator_ext.dir/test_make_iterator.cpp.o
[ 62%] Building CXX object tests/CMakeFiles/test_holders_ext.dir/test_holders.cpp.o
[ 62%] Building CXX object tests/CMakeFiles/test_enum_ext.dir/test_enum.cpp.o
[ 64%] Building CXX object tests/CMakeFiles/test_intrusive_ext.dir/test_intrusive_impl.cpp.o
[ 66%] Linking CXX shared module test_inter_module_2_ext.cpython-311-x86_64-linux-gnu.so
[ 68%] Linking CXX shared module test_inter_module_1_ext.cpython-311-x86_64-linux-gnu.so
/build/source/tests/test_classes.cpp: In lambda function:
/build/source/tests/test_classes.cpp:448:97: warning: redundant move in return statement [8;;https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wredundant-move-Wredundant-move8;;]
  448 |     m.def("test_type_object_t", [](nb::type_object_t<Struct> h) -> nb::object { return std::move(h); });
      |                                                                                        ~~~~~~~~~^~~
/build/source/tests/test_classes.cpp:448:97: note: remove ‘std::move’ call
[ 70%] Linking CXX shared module test_enum_ext.cpython-311-x86_64-linux-gnu.so
[ 70%] Built target test_inter_module_2_ext
[ 70%] Built target test_inter_module_1_ext
[ 72%] Linking CXX shared module test_intrusive_ext.cpython-311-x86_64-linux-gnu.so
[ 75%] Linking CXX shared module test_exception_ext.cpython-311-x86_64-linux-gnu.so
[ 77%] Linking CXX shared module test_eval_ext.cpython-311-x86_64-linux-gnu.so
[ 77%] Built target test_enum_ext
[ 77%] Built target test_intrusive_ext
[ 77%] Built target test_exception_ext
[ 77%] Built target test_eval_ext
[ 79%] Linking CXX shared module test_chrono_ext.cpython-311-x86_64-linux-gnu.so
[ 81%] Linking CXX shared module test_make_iterator_ext.cpython-311-x86_64-linux-gnu.so
[ 83%] Linking CXX shared module test_issue_ext.cpython-311-x86_64-linux-gnu.so
[ 83%] Built target test_chrono_ext
[ 83%] Built target test_make_iterator_ext
[ 83%] Built target test_issue_ext
[ 85%] Linking CXX shared module test_functions_ext.cpython-311-x86_64-linux-gnu.so
[ 85%] Built target test_functions_ext
[ 87%] Linking CXX shared module test_holders_ext.cpython-311-x86_64-linux-gnu.so
[ 89%] Linking CXX shared module test_ndarray_ext.cpython-311-x86_64-linux-gnu.so
[ 89%] Built target test_holders_ext
[ 89%] Built target test_ndarray_ext
[ 91%] Linking CXX shared module test_classes_ext.cpython-311-x86_64-linux-gnu.so
[ 91%] Built target test_classes_ext
[ 93%] Linking CXX shared module test_bind_vector_ext.cpython-311-x86_64-linux-gnu.so
[ 93%] Built target test_bind_vector_ext
[ 95%] Linking CXX shared module test_stl_ext.cpython-311-x86_64-linux-gnu.so
[ 95%] Built target test_stl_ext
[ 97%] Linking CXX shared module test_eigen_ext.cpython-311-x86_64-linux-gnu.so
[ 97%] Built target test_eigen_ext
[100%] Linking CXX shared module test_bind_map_ext.cpython-311-x86_64-linux-gnu.so
[100%] Built target test_bind_map_ext
============================= test session starts ==============================
platform linux -- Python 3.11.9, pytest-8.1.1, pluggy-1.4.0
rootdir: /build/source
configfile: pyproject.toml
testpaths: tests
collected 395 items

tests/test_chrono.py ..............................................      [ 11%]
tests/test_classes.py ........................................           [ 21%]
tests/test_eigen.py .................................................... [ 34%]
................................                                         [ 43%]
tests/test_enum.py ........                                              [ 45%]
tests/test_eval.py ....                                                  [ 46%]
tests/test_exception.py ....................                             [ 51%]
tests/test_functions.py .............................................    [ 62%]
tests/test_holders.py .................                                  [ 66%]
tests/test_inter_module.py .                                             [ 67%]
tests/test_intrusive.py ....                                             [ 68%]
tests/test_issue.py ...                                                  [ 68%]
tests/test_make_iterator.py ....                                         [ 69%]
tests/test_ndarray.py .................................s....             [ 79%]
tests/test_stl.py ...................................................... [ 93%]
..................                                                       [ 97%]
tests/test_stl_bind_map.py ....                                          [ 98%]
tests/test_stl_bind_vector.py .....                                      [100%]

=============================== warnings summary ===============================
../../nix/store/ysh9x36sy135cmf3ypiawjgavg6b7zfh-python3.11-tensorflow-2.13.0/lib/python3.11/site-packages/tensorflow/__init__.py:29
  /nix/store/ysh9x36sy135cmf3ypiawjgavg6b7zfh-python3.11-tensorflow-2.13.0/lib/python3.11/site-packages/tensorflow/__init__.py:29: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
    import distutils as _distutils

../../nix/store/ysh9x36sy135cmf3ypiawjgavg6b7zfh-python3.11-tensorflow-2.13.0/lib/python3.11/site-packages/tensorflow/python/debug/cli/debugger_cli_common.py:19
  /nix/store/ysh9x36sy135cmf3ypiawjgavg6b7zfh-python3.11-tensorflow-2.13.0/lib/python3.11/site-packages/tensorflow/python/debug/cli/debugger_cli_common.py:19: DeprecationWarning: module 'sre_constants' is deprecated
    import sre_constants

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================= 394 passed, 1 skipped, 2 warnings in 29.63s ==================
/nix/store/f0v37ivq8lvim98w23x1fl31wslbryy1-pytest-check-hook/nix-support/setup-hook: line 53:   833 Aborted                 (core dumped) /nix/store/lpi16513bai8kg2bd841745vzk72475x-python3-3.11.9/bin/python3.11 -m pytest

GaetanLepage added a commit to GaetanLepage/nixpkgs that referenced this issue May 11, 2024
@GaetanLepage
Copy link
Author

More precisely, the issue is fixed since/by eed8201.

@wjakob
Copy link
Owner

wjakob commented May 11, 2024

Would you be able to provide a backtrace of a crashing build in debug mode? It's really difficult to say what the problem might be based on this information.

GaetanLepage added a commit to GaetanLepage/nixpkgs that referenced this issue May 12, 2024
@GaetanLepage
Copy link
Author

Would you be able to provide a backtrace of a crashing build in debug mode? It's really difficult to say what the problem might be based on this information.

I am not sure on how to extract such a backtrace, knowing that it is the python process itself that is crashing.
We will most likely wait for the next release of nanobind and disable the jax-related tests in the meantime.

@wjakob
Copy link
Owner

wjakob commented May 13, 2024

I'm concerned that there may be another issue. The commit you listed doesn't really explain why one version crashes and the other one works. Could you run pytest with gdb --args python3 -m pytest, after having made a debug build? Then, when you encounter the failure, print "bt" to get a backtrace.

@wjakob
Copy link
Owner

wjakob commented May 22, 2024

ping @GaetanLepage

@GaetanLepage
Copy link
Author

I'm concerned that there may be another issue. The commit you listed doesn't really explain why one version crashes and the other one works. Could you run pytest with gdb --args python3 -m pytest, after having made a debug build? Then, when you encounter the failure, print "bt" to get a backtrace.

Sorry for the delay :/
I have been trying to get this working, but it is not very easy within the nix sandbox.
I did compile the tests using make -d to get the debugging symbols in.
However, when I run the tests with gdb I don't get any interesting output.
It says, before running the tests:

Reading symbols from python...
(No debugging symbols found in python)

Am I doing something wrong ?

@wjakob
Copy link
Owner

wjakob commented May 27, 2024

Hi @GaetanLepage ,

it's expected that Python itself would not have interesting debug symbols, it's the plugin that will provide them. To get CMake to build the nanobind test suite with debug symbols, I don't think that make -d is enough. You need to run the CMake process with -DCMAKE_BUILD_TYPE=Debug and then compile. After starting gdb with the arguments I specified earlier, you need to enter run so that it actually launches the process. This should then reproduce your crash. At that point, you can enter bt to get the backtrace.

@GaetanLepage
Copy link
Author

Thank you for those precise instructions.
I was able to perform those operations by exiting the sandbox.
Hopefully, I was able to replicate the crash.

Here is the backtrace:

========================================================================== 391 passed, 4 skipped in 14.49s ===========================================================================

Thread 1 "pt_main_thread" received signal SIGABRT, Aborted.
0x00007ffff76a2efc in __pthread_kill_implementation () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
(gdb) 
(gdb) bt
#0  0x00007ffff76a2efc in __pthread_kill_implementation () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#1  0x00007ffff7652e86 in raise () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#2  0x00007ffff763b935 in abort () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#3  0x00007ffff63aa137 in nanobind::detail::internals_cleanup () at /home/gaetan/temp/nanobind/src/nb_internals.cpp:312
#4  0x00007ffff7b72b71 in Py_FinalizeEx.part.0 () from /nix/store/7hnr99nxrd2aw6lghybqdmkckq60j6l9-python3-3.11.9/lib/libpython3.11.so.1.0
#5  0x00007ffff7b79248 in Py_RunMain () from /nix/store/7hnr99nxrd2aw6lghybqdmkckq60j6l9-python3-3.11.9/lib/libpython3.11.so.1.0
#6  0x00007ffff763d10e in __libc_start_call_main () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#7  0x00007ffff763d1c9 in __libc_start_main_impl () from /nix/store/k7zgvzp2r31zkg9xqgjim7mbknryv6bs-glibc-2.39-52/lib/libc.so.6
#8  0x0000000000401075 in _start ()

@GaetanLepage
Copy link
Author

We are preparing the upgrade to nanobind 2.0 and there, this issue does not occur.
The tests work fine, and pytest exits properly, even though jax and jaxlib are present in the environment.

@wjakob
Copy link
Owner

wjakob commented May 27, 2024

Awesome. One last question: which version of nanobind is this? Can you tell me what's on nb_internals.cpp line 312 in your version?

@GaetanLepage
Copy link
Author

GaetanLepage commented May 27, 2024

Awesome. One last question: which version of nanobind is this? Can you tell me what's on nb_internals.cpp line 312 in your version?

This is on the v1.9.2 tag of nanobind. Same process on tag v2.0.0 does not crash.

Here are lines 311 - 313 of nb_internals.cpp:

        #if defined(NB_ABORT_ON_LEAK)
            abort(); // Extra-strict behavior for the CI server
        #endif

@wjakob
Copy link
Owner

wjakob commented May 27, 2024

Ok. So this is intentional. There is a reference leak, and the test suite crashes at the end to point everyone's attention to this. (Reference leaks are detected all the way at the end when the interpreter shuts down, and at that point this is the only way to make sure the issue doesn't go unnoticed). I will close this then.

@wjakob wjakob closed this as completed May 27, 2024
@wjakob
Copy link
Owner

wjakob commented May 27, 2024

Thanks for helping to localize the issue!

@wjakob
Copy link
Owner

wjakob commented May 27, 2024

The issue here very likely lies with one of the other tensor frameworks. They sometimes hold on to the last ndarray converted and don't release a reference to a nanobind object by the time this shutdown routine is called. It's a benign issue.

@GaetanLepage
Copy link
Author

Ok great ! Thanks for your patience.
So this was caused by JAX somehow ? What have you changed since then that makes this issue go away ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants