Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of the shared memory arrays : add a colfile mmap option #265

Open
jonwright opened this issue Mar 19, 2024 · 2 comments
Open

Get rid of the shared memory arrays : add a colfile mmap option #265

jonwright opened this issue Mar 19, 2024 · 2 comments

Comments

@jonwright
Copy link
Member

In sinograms/properties.py and sinograms/point_by_point.py the code works via shared memory.
This does not scale beyond one node and it fails on python2.7.

For global read-only memory we could use mmap with numpy on a non-compressed hdf5 file (https://gist.github.com/maartenbreddels/09e1da79577151e5f7fec660c209f06e):

assert dset.chunks is None and dset.compression is None and dset.is_virtual is None and dset.is_external is None and etc.
file = open(path, "rb")
fileno = file.fileno()
mapping = mmap.mmap(fileno, 0, access=mmap.ACCESS_READ)
np.frombuffer(mapping, dtype=dset.dtype, count=dset.size, offset=dset.id.get_offset()).reshape(dset.shape)

This may be useful for reducing some out-of-memory problems.

Another upgrade path could be looking into dask.dataframe for distributed processing.

@jonwright
Copy link
Member Author

jonwright commented May 21, 2024

To make some progress try to break this up into a bunch of smaller tasks:

  • Point by point code: write a hdf colfile. Each worker process reads it during pool init.

Properties.py is more challenging:

  • Make lima_segmenter record the pixel -> peak labeling for both of the cp and lm schemes.
  • Refactor lima_segmenter to write fewer files (e.g. one per process?).
  • Labels will be saved with pixels.
  • Peaks2d properties array (s1, sI, scI, srI, id) are available during segmenting to be saved with sparse pixels.
  • Check the io speed / size with/without compression for saving pixels peaks. Pick something.
  • Find and save the overlaps. This is one 'peaksearch' per overlap dimension. Output of a form (peak_i, peak_j, score).
  • Determine the peaks3d labels across omega or dty
  • Determine the peaks4d labels across the sinogram

@jonwright
Copy link
Member Author

Note : multiprocessing + shared memory seems to be buggy. The remove from tracker monkeypatch does not work. Abandon it.

Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/cvmfs/hpc.esrf.fr/software/packages/linux/x86_64/jupyter-slurm/2023.10.7/envs/jupyter-slurm/lib/python3.11/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/hpc.esrf.fr/software/packages/linux/x86_64/jupyter-slurm/2023.10.7/envs/jupyter-slurm/lib/python3.11/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant