Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openmpi fork() issue with python datareader #2388

Open
jvwilliams23 opened this issue Nov 17, 2023 · 0 comments
Open

openmpi fork() issue with python datareader #2388

jvwilliams23 opened this issue Nov 17, 2023 · 0 comments

Comments

@jvwilliams23
Copy link
Contributor

Hi,

I am trying to run the GAN tutorial on MNIST (I made some minor modifications for my system):

import argparse
import lbann
import lbann.launcher
from gan_model import build_model
from mnist_dataset import make_data_reader

mini_batch_size = 128
num_epochs = 100
job_name = "gan"

trainer = lbann.Trainer(mini_batch_size)
model = build_model(num_epochs)
data_reader = make_data_reader()
opt = lbann.Adam(learn_rate=1e-4, beta1=0., beta2=0.99, eps=1e-8)

kwargs = {
    "nodes": 1,
    "scheduler" : "openmpi",
    "setup_only" : True,
    "time_limit" : 30,
}

lbann.run(trainer, model, data_reader, opt,
                           job_name=job_name,
                           **kwargs)

which gives the batch script:

export IBV_FORK_SAFE=1
echo "Started at $(date)"
mpiexec -n 1 --map-by ppr:1:node -wdir /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/project/tutorials_lbann/gan/mnist/20231117_145903_gan_n1_ppn1 /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/lbann-latest/build_newompi3/install/bin/lbann --prototext=/lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/project/tutorials_lbann/gan/mnist/20231117_145903_gan_n1_ppn1/experiment.prototext
status=$?
echo "Finished at $(date)"
exit ${status}

I get the error below (I already added export IBV_FORK_SAFE=1 to the batch.sh script produced):

--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'sqg2b16', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[6305,1],0] (PID 56764)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
****************************************************************
Caught signal 11 (SIGSEGV - invalid memory reference) on rank 0
Stack trace:
   0: lbann::stack_trace::get[abi:cxx11]()
   1: lbann::exception::exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
   2: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/lbann-latest/build_newompi3/install/lib64/liblbann.so.0.104.0(+0xc470a71) [0x2ad53e4f4a71] (could not find stack frame symbol)
   3: /usr/lib64/libpthread.so.0(+0xf5d0) [0x2ad58bdc35d0] (could not find stack frame symbol)
   4: std::_Hashtable<std::string, std::string, std::allocator<std::string>, std::__detail::_Identity, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, true, true> >::clear()
   5: google::protobuf::DescriptorPool::FindFileByName(std::string const&) const
   6: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/python3.7/site-packages/google/protobuf/pyext/_message.cpython-37m-x86_64-linux-gnu.so(+0xb8e7a) [0x2ad6193a9e7a] (could not find stack frame symbol)
   7: _PyMethodDef_RawFastCallKeywords (demangling failed)
   8: _PyMethodDescr_FastCallKeywords (demangling failed)
   9: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6dbb5) [0x2ad587abcbb5] (could not find stack frame symbol)
  10: _PyEval_EvalFrameDefault (demangling failed)
  11: _PyEval_EvalCodeWithName (demangling failed)
  12: PyEval_EvalCodeEx (demangling failed)
  13: PyEval_EvalCode (demangling failed)
  14: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x155c7e) [0x2ad587ba4c7e] (could not find stack frame symbol)
  15: _PyMethodDef_RawFastCallDict (demangling failed)
  16: _PyCFunction_FastCallDict (demangling failed)
  17: _PyEval_EvalFrameDefault (demangling failed)
  18: _PyEval_EvalCodeWithName (demangling failed)
  19: _PyFunction_FastCallKeywords (demangling failed)
  20: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
  21: _PyEval_EvalFrameDefault (demangling failed)
  22: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
  23: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
  24: _PyEval_EvalFrameDefault (demangling failed)
  25: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
  26: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
  27: _PyEval_EvalFrameDefault (demangling failed)
  28: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
  29: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
  30: _PyEval_EvalFrameDefault (demangling failed)
  31: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
  32: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x91fc9) [0x2ad587ae0fc9] (could not find stack frame symbol)
  33: _PyObject_CallMethodIdObjArgs (demangling failed)
  34: PyImport_ImportModuleLevelObject (demangling failed)
  35: _PyEval_EvalFrameDefault (demangling failed)
  36: _PyEval_EvalCodeWithName (demangling failed)
  37: PyEval_EvalCodeEx (demangling failed)
  38: PyEval_EvalCode (demangling failed)
  39: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x155c7e) [0x2ad587ba4c7e] (could not find stack frame symbol)
  40: _PyMethodDef_RawFastCallDict (demangling failed)
  41: _PyCFunction_FastCallDict (demangling failed)
  42: _PyEval_EvalFrameDefault (demangling failed)
  43: _PyEval_EvalCodeWithName (demangling failed)
  44: _PyFunction_FastCallKeywords (demangling failed)
  45: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
  46: _PyEval_EvalFrameDefault (demangling failed)
  47: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
  48: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
  49: _PyEval_EvalFrameDefault (demangling failed)
  50: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
  51: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
  52: _PyEval_EvalFrameDefault (demangling failed)
  53: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
  54: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
  55: _PyEval_EvalFrameDefault (demangling failed)
  56: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
  57: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x91fc9) [0x2ad587ae0fc9] (could not find stack frame symbol)
  58: _PyObject_CallMethodIdObjArgs (demangling failed)
  59: PyImport_ImportModuleLevelObject (demangling failed)
  60: _PyEval_EvalFrameDefault (demangling failed)
  61: _PyEval_EvalCodeWithName (demangling failed)
  62: PyEval_EvalCodeEx (demangling failed)
  63: PyEval_EvalCode (demangling failed)
  64: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x155c7e) [0x2ad587ba4c7e] (could not find stack frame symbol)
  65: _PyMethodDef_RawFastCallDict (demangling failed)
  66: _PyCFunction_FastCallDict (demangling failed)
  67: _PyEval_EvalFrameDefault (demangling failed)
  68: _PyEval_EvalCodeWithName (demangling failed)
  69: _PyFunction_FastCallKeywords (demangling failed)
  70: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
  71: _PyEval_EvalFrameDefault (demangling failed)
  72: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
  73: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
  74: _PyEval_EvalFrameDefault (demangling failed)
  75: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
  76: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
  77: _PyEval_EvalFrameDefault (demangling failed)
  78: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
  79: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
  80: _PyEval_EvalFrameDefault (demangling failed)
  81: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
  82: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x91fc9) [0x2ad587ae0fc9] (could not find stack frame symbol)
  83: _PyObject_CallMethodIdObjArgs (demangling failed)
  84: PyImport_ImportModuleLevelObject (demangling failed)
  85: _PyEval_EvalFrameDefault (demangling failed)
  86: _PyEval_EvalCodeWithName (demangling failed)
  87: PyEval_EvalCodeEx (demangling failed)
  88: PyEval_EvalCode (demangling failed)
  89: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x155c7e) [0x2ad587ba4c7e] (could not find stack frame symbol)
  90: _PyMethodDef_RawFastCallDict (demangling failed)
  91: _PyCFunction_FastCallDict (demangling failed)
  92: _PyEval_EvalFrameDefault (demangling failed)
  93: _PyEval_EvalCodeWithName (demangling failed)
  94: _PyFunction_FastCallKeywords (demangling failed)
  95: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
  96: _PyEval_EvalFrameDefault (demangling failed)
  97: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
  98: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
  99: _PyEval_EvalFrameDefault (demangling failed)
 100: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
 101: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
 102: _PyEval_EvalFrameDefault (demangling failed)
 103: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
 104: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
 105: _PyEval_EvalFrameDefault (demangling failed)
 106: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
 107: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x91fc9) [0x2ad587ae0fc9] (could not find stack frame symbol)
 108: _PyObject_CallMethodIdObjArgs (demangling failed)
 109: PyImport_ImportModuleLevelObject (demangling failed)
 110: _PyEval_EvalFrameDefault (demangling failed)
 111: _PyEval_EvalCodeWithName (demangling failed)
 112: PyEval_EvalCodeEx (demangling failed)
 113: PyEval_EvalCode (demangling failed)
 114: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x155c7e) [0x2ad587ba4c7e] (could not find stack frame symbol)
 115: _PyMethodDef_RawFastCallDict (demangling failed)
 116: _PyCFunction_FastCallDict (demangling failed)
 117: _PyEval_EvalFrameDefault (demangling failed)
 118: _PyEval_EvalCodeWithName (demangling failed)
 119: _PyFunction_FastCallKeywords (demangling failed)
 120: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
 121: _PyEval_EvalFrameDefault (demangling failed)
 122: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
 123: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
 124: _PyEval_EvalFrameDefault (demangling failed)
 125: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
 126: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
 127: _PyEval_EvalFrameDefault (demangling failed)
****************************************************************
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

FYI, I built LBANN with cmake (using openmpi version 3.1.6). I am also using python 3.7.
Any help to resolve this error would be greatly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant