Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurring MPI segmentation fault #2641

Open
Helveg opened this issue Dec 13, 2023 · 5 comments
Open

Recurring MPI segmentation fault #2641

Helveg opened this issue Dec 13, 2023 · 5 comments
Labels

Comments

@Helveg
Copy link
Contributor

Helveg commented Dec 13, 2023

Context

In many of my packages NEURON and MPI have to interact, and unless I import mpi4py before NEURON tries any MPI init, I get stuck at the following point:

[fv-az740-439:03379] *** Process received signal ***
[fv-az740-439:03379] Signal: Segmentation fault (11)
[fv-az740-439:03379] Associated errno: Unknown error 32526 (32526)
[fv-az740-439:03379] Signal code:  (0)
[fv-az740-439:03379] Failing at address: 0x7f0e9083dd40
[fv-az740-439:03379] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f0e9f842520]
[fv-az740-439:03379] [ 1] /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/neuron/.data/lib/libnrniv.so(+0x291c40)[0x7f0e90491c40]
[fv-az740-439:03379] [ 2] /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/neuron/.data/lib/libnrniv.so(+0x293659)[0x7f0e90493659]
[fv-az740-439:03379] [ 3] /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/neuron/.data/lib/libnrniv.so(_ZN3BBS15netpar_mindelayEd+0x21)[0x7f0e90495491]
[fv-az740-439:03379] [ 4] /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/neuron/.data/lib/libnrniv.so(hoc_call_ob_proc+0x1b7)[0x7f0e90500757]
[fv-az740-439:03379] [ 5] /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/neuron/.data/lib/libnrniv.so(_Z20hoc_object_componentv+0x383)[0x7f0e905013a3]
[fv-az740-439:03379] [ 6] /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/neuron/.data/lib/libnrnpython3.so(+0x11db1)[0x7f0e901c1db1]
[fv-az740-439:03379] [ 7] /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/neuron/.data/lib/libnrnpython3.so(+0x18a62)[0x7f0e901c8a62]
[fv-az740-439:03379] [ 8] /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/neuron/.data/lib/libnrniv.so(_ZN10OcJumpImpl7fpycallEPFPvS0_S0_ES0_S0_+0x42)[0x7f0e904b4b72]
[fv-az740-439:03379] [ 9] /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/neuron/.data/lib/libnrnpython3.so(+0x1305c)[0x7f0e901c305c]
[fv-az740-439:03379] [10] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(_PyObject_MakeTpCall+0x7f)[0x7f0e9fd52c7f]
[fv-az740-439:03379] [11] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x7491)[0x7f0e9fdba271]
[fv-az740-439:03379] [12] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(+0x1b2006)[0x7f0e9fdb2006]
[fv-az740-439:03379] [13] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(+0x154db1)[0x7f0e9fd54db1]
[fv-az740-439:03379] [14] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(+0x22c144)[0x7f0e9fe2c144]
[fv-az740-439:03379] [15] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x7491)[0x7f0e9fdba271]
[fv-az740-439:03379] [16] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(+0x1b2006)[0x7f0e9fdb2006]
[fv-az740-439:03379] [17] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(+0x154db1)[0x7f0e9fd54db1]
[fv-az740-439:03379] [18] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(+0x22c144)[0x7f0e9fe2c144]
[fv-az740-439:03379] [19] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x611d)[0x7f0e9fdb8efd]
[fv-az740-439:03379] [20] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(+0x1b2006)[0x7f0e9fdb2006]
[fv-az740-439:03379] [21] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(+0x154db1)[0x7f0e9fd54db1]
[fv-az740-439:03379] [22] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(+0x22c144)[0x7f0e9fe2c144]
[fv-az740-439:03379] [[23](https://github.com/dbbs-lab/arborize/actions/runs/7195098174/job/19597032548#step:7:24)] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x7491)[0x7f0e9fdba271]
[fv-az740-439:03379] [[24](https://github.com/dbbs-lab/arborize/actions/runs/7195098174/job/19597032548#step:7:25)] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(+0x1b2006)[0x7f0e9fdb2006]
[fv-az740-439:03379] [[25](https://github.com/dbbs-lab/arborize/actions/runs/7195098174/job/19597032548#step:7:26)] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(+0x154db1)[0x7f0e9fd54db1]
[fv-az740-439:03379] [[26](https://github.com/dbbs-lab/arborize/actions/runs/7195098174/job/19597032548#step:7:27)] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(+0x22c144)[0x7f0e9fe2c144]
[fv-az740-439:03379] [[27](https://github.com/dbbs-lab/arborize/actions/runs/7195098174/job/19597032548#step:7:28)] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x73d1)[0x7f0e9fdba1b1]
[fv-az740-439:03379] [[28](https://github.com/dbbs-lab/arborize/actions/runs/7195098174/job/19597032548#step:7:29)] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(+0x1b2006)[0x7f0e9fdb2006]
[fv-az740-439:03379] [[29](https://github.com/dbbs-lab/arborize/actions/runs/7195098174/job/19597032548#step:7:30)] /opt/hostedtoolcache/Python/3.10.13/x64/lib/libpython3.10.so.1.0(+0x154db1)[0x7f0e9fd54db1]
[fv-az740-439:0[33](https://github.com/dbbs-lab/arborize/actions/runs/7195098174/job/19597032548#step:7:34)79] *** End of error message ***

If you need I can give you some repos and commits to look at. It happens on Ubuntu + OpenMPI installed from apt and any pip installable NEURON version (I've checked until 8.0.0)

The simplest test under which it occurred is in nrn-patch:

def test_smth(self):
  from patch import p

  p.finitialize(-70)

which goes through a bit of internals which auto detects if we're running under MPI and does a lot of the weird ParallelContext calls you'd need to do in NEURON before anything parallel works. If you look at:

https://github.com/dbbs-lab/patch/actions/runs/7061832089/job/19224331875#step:7:76
https://github.com/dbbs-lab/patch/actions/runs/7061933373/job/19224647645#step:7:102

Then you have an exact reproducer of how importing mpi4py before NEURON resolves the issue.

@Helveg Helveg added the bug label Dec 13, 2023
@Helveg
Copy link
Contributor Author

Helveg commented Dec 13, 2023

The exact test is:

    def test_stimulate(self):
        s = p.Section()
        pp = p.ExpSyn(s(0.5))
        stim = pp.stimulate(start=0, number=1)
        stim._connections[pp].weight[0] = 0.4
        r = s.record()
        p.finitialize(-70)
        p.continuerun(10)
        self.assertAlmostEqual(list(r)[-1], -68.0, delta=0.1)

but you can reduce it down to

    def test_stimulate(self):
        p.finitialize(-70)

and it will still segfault

@alkino
Copy link
Member

alkino commented Dec 13, 2023

Would be helpful if you can give a (full) file that reproduce and the command you used.

@pramodk
Copy link
Member

pramodk commented Dec 13, 2023

@Helveg : As Nico mentioned, we need help to reproduce the issue.

In pramodk/patch/pull/1, you can see my failed attempt to reproduce the segfault. I thought, importing neuron at the top/beginning, the issue should be reproduced?

By the way, only additional thing I added is either NEURON_INIT_MPI=1 or explicit MPI initialisation via h.nrnmpi_init().

I am running nrn/src/parallel/test0.py and nrn/src/parallel/test0.hocto confirm that NEURON's MPI initialisation works in standalone test.

@Helveg
Copy link
Contributor Author

Helveg commented Dec 14, 2023

I think it's the exact sequence in which Patch sets things up with or without mpi4py present. Perhaps it doesn't happen when the very very first thing neuron does is MPI init. I think you'll have to leave the test setup the way it is, and run the 1 test, then uninstall mpi4py and rerun the test, perhaps like this:

      run: |
        cd tests
        mpiexec -n 2 coverage run --parallel-mode -m unittest discover -v -s . -p test_stimulate
        pip uninstall mpi4py
        mpiexec -n 2 coverage run --parallel-mode -m unittest discover -v -s . -p test_stimulate

I'll be working on this again in a few days, I can get you a better reproducer if this doesn't do the trick

@Helveg
Copy link
Contributor Author

Helveg commented Dec 14, 2023

I went ahead and created a reproducer:

https://github.com/Helveg/nrn-segfault/tree/main

The setup sequence of Patch with mpi4py installed works, but without it causes the segfault, which means essentially this difference causes a segfault:

https://github.com/dbbs-lab/patch/blob/c9e0633cd7a16979f633ec431f9304fa08a9bfaa/patch/interpreter.py#L310-L314

And using a debugger you should be able to follow the steps taken by the run function:

https://github.com/dbbs-lab/patch/blob/c9e0633cd7a16979f633ec431f9304fa08a9bfaa/patch/interpreter.py#L290-L292

Helveg added a commit to dbbs-lab/arborize that referenced this issue Dec 15, 2023
Helveg added a commit to dbbs-lab/arborize that referenced this issue Dec 15, 2023
* switched to pyproject.toml

* define public api

* black

* bump glia

* fix workflow file

* fix workflow file pt 2

* fix main workflow deps

* drop 3.8

* run some tests separately to avoid segfault

see neuronsimulator/nrn#2641

* fix parallel spike assertions

* bump numpy dep for faster 3.11 tests
Helveg added a commit to dbbs-lab/bsb-neuron that referenced this issue Feb 14, 2024
Helveg added a commit to dbbs-lab/bsb-neuron that referenced this issue Feb 14, 2024
* dump

* add multi sim and test

* fix merged pyproject

* fix numpy ints messing with NEURON

* fix NeuronPopulation typo and numpy int lookup

* improve arborized model type handling (still open issue)

* add chunked test

* fix multi CM/chunk transmapping + gid on instances

* validate multichunk test

* validate multi CM

* add ci

* fix ci

* use fixed arborize version that works without installing arbor

* bump deps

* avoid neuronsimulator/nrn#2641
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants