Fix multiprocessing in face detection gallery example #7140

lagru · 2023-09-18T12:53:47Z

Description

This change is informed by several issues:

The previous version stacked delayed put a delayed call around a generator. Starting with dask 2023.9.2 this lead the traceback attached at the end of this message.

I opted to remove dask completely because even though I managed to get it working with the newer dask version, the task apparently isn't gaining much from multi-threading. Even with only two workers, CPU cores aren't at 100% on my machine. Using a multi-processing scheduler resulted in errors ("worker unexpectetly finished" or something like that). So instead I opted to use Python's default multiprocessing pool which speeds up the example nicely (from 33s to 9s on my machine).

Traceback (most recent call last):
  File "/home/lg/Res/scikit-image/doc/examples/applications/plot_haar_extraction_selection_classification.py", line 81, in <module>
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=150,
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lg/.local/lib/venv/skimagedev/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 211, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lg/.local/lib/venv/skimagedev/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 2614, in train_test_split
    arrays = indexable(*arrays)
             ^^^^^^^^^^^^^^^^^^
  File "/home/lg/.local/lib/venv/skimagedev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 455, in indexable
    check_consistent_length(*result)
  File "/home/lg/.local/lib/venv/skimagedev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 406, in check_consistent_length
    lengths = [_num_samples(X) for X in arrays if X is not None]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lg/.local/lib/venv/skimagedev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 406, in <listcomp>
    lengths = [_num_samples(X) for X in arrays if X is not None]
               ^^^^^^^^^^^^^^^
  File "/home/lg/.local/lib/venv/skimagedev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 347, in _num_samples
    raise TypeError(
TypeError: Singleton array array(<generator object <genexpr> at 0x7f5cee64d2a0>, dtype=object) cannot be considered a valid collection

See also our currently failing CI, e.g. https://github.com/scikit-image/scikit-image/actions/runs/6171455485/job/16883631460#step:5:2520.

Checklist

A descriptive but concise pull request title
Docstrings for all functions
Unit tests
A gallery example in ./doc/examples for new features
Contribution guide is followed

Release note

Summarize the introduced changes in the code block below in one or a few sentences. The
summary will be included in the next release notes automatically:

...

This change is informed by several issues: The previous version stacked `delayed` put a delayed call around a generator. Starting with dask 2023.9.2 this lead the traceback attached at the end of this message. I opted to remove dask completely because even though I managed to get it working with the newer dask version, the task apparently isn't gaining much from multi-threading. Even with only two workers, CPU cores aren't at 100% on my machine. Using a multi-processing scheduler resulted in errors ("worker unexpectetly finished" or something like that). So instead I opted to use Python's default multiprocessing pool which speeds up the example nicely (from 33s to 9s on my machine). Traceback (most recent call last): File "/home/lg/Res/scikit-image/doc/examples/applications/plot_haar_extraction_selection_classification.py", line 81, in <module> X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=150, [... shortened ...] File "/home/lg/.local/lib/venv/skimagedev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 347, in _num_samples raise TypeError( TypeError: Singleton array array(<generator object <genexpr> at 0x7f5cee64d2a0>, dtype=object) cannot be considered a valid collection.

lagru · 2023-09-18T16:25:43Z

When trying to use dask.compute(*X, scheduler="processes") I get the following error:

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html

I am not sure how to add a main guard in context of the sphinx gallery example. It runs fine when I copy and run the content of plot_haar_extraction_selection_classification.py into an IPython console. 🤔

When running the plot_haar_extraction_selection_classification.py directly it fails with the error attached at the end. However, executing the file via sphinx gallery or its content in an IPython console seems to work. So test it in the CI. RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. To fix this issue, refer to the "Safe importing of main module" section in https://docs.python.org/3/library/multiprocessing.html

mkcor · 2023-09-19T12:48:42Z

Thank you so much for tackling this, @lagru! What made you go back from multiprocessing.pool to dask (between d774eeb and 0079f69)?

lagru · 2023-09-20T09:48:07Z

I am hoping to find a solution that allows for parallelism in this example because the long build time without even using all resources offends me personally. 😅 So I tried multiple approaches (therefore the going back and forth) locally and here:

Use multi-threading either with dask or ThreadPool. While this seems to work fine in all scenarios, the parallelized code doesn't seem to gain much from multi-threading. I played around with the number of workers and even with 2 it's slowed down compared to using a single worker.
So I tried the multi-processing route instead. But both lead to weird errors for which I don't fully grasp the reasons. Pasting the code into the console runs fine in both cases.

But with dask-based multi-processing I can't run the example directly without the RuntimeError described in 0079f69. It runs fine when executed indirectly through sphinx, however, this approach leads to the CI hanging forever depending on Python version and architecture.

With the Pool-based multi-processing (no dask) it works in all local cases but with the same behavior on the CI.

Not sure how to proceed. Right now the only approach that works everywhere is the single-threaded one. But then the example is better off just ripping out the references to dask and parallelism... For now I'll settle on making a hopefully temporary hotfix in another PR and keep this one open for investigation... (see #7141)

This is meant to be a temporary hotfix while [1] is being investigated. It naturally has a similar performance compared to the single-threaded dask solution. [1] scikit-image#7140

lagru · 2023-09-20T10:29:32Z

Aha, I just noticed that 0079f69 is actually passing fine everywhere except for

Some digging with git blame shows that this has been an issue before:

Build and test the docs in Azure #3873
Build and test the documentation in Azure Pipelines #3854 (comment)
Build and test the documentation in Azure Pipelines #3854 (comment)

In short, on Windows multiprocessing imports the full contents of the file when spawning extra processes. This leads to starting an infinite number of processes.
So, basically, the multiprocessing code has to be wrapped somehow.

I am also seeing this or similar looking behavior on Linux actually. What I don't understand about this explanation is, how this aspect is managed when running the example through sphinx or through IPython.

@soupault, do you remember the context for testing the gallery example in two different ways on Windows? Examples are run indirectly through sphinx, and directly by passing the files to Python? I guess we want to keep it like that... Though I am noticing that we only do so on Windows. TEST_EXAMPLES is disabled on Linux and MacOS.

This is a temporary hotfix while [1] is being investigated. Naturally, it has a similar performance compared to the single-threaded Dask solution. [1] #7140

soupault · 2023-09-22T16:33:35Z

how this aspect is managed when running the example through sphinx or through IPython.

Sounds like a good topic to investigate further :).

do you remember the context for testing the gallery example in two different ways on Windows? Examples are run indirectly through sphinx, and directly by passing the files to Python?

Oh, I am not really sure. It might have been connected to the way how documentation errors were handled by AppVeyor. Most likely, it was just me making an exact replica of the setup on Travis (Linux/MacOS) once we moved to the fast and unlimited Azure. Building documentation on Windows, at this point, sounds rather unnecessary.

github-actions · 2024-03-21T02:24:31Z

Hello scikit-image core devs! There hasn't been any activity on this PR for more than 180 days. I have marked it as "dormant" to make it easy to find.
To our contributors, thank you for your contribution and apologies if this contribution fell through the cracks! Hopefully this ping will help bring some fresh attention to the review. If you need help, you can always reach out on our forum
If you think that this PR is no longer relevant, you may close it, or we may do it at some point (either way, it will be done manually). If you think the PR is valuable but you no longer have the bandwidth to update it, please let us know, so that someone can take it over. 🙏

lagru added the 📄 type: Documentation Updates, fixes and additions to documentation label Sep 18, 2023

lagru changed the title ~~Use multiprocessing pool instead of dask in plot_haar_extraction_selection_classification.py~~ Use multiprocessing pool instead of dask in face detection gallery example Sep 18, 2023

lagru mentioned this pull request Sep 18, 2023

Infer object-dtype for if extra_property depends on mask size in regionprops_table #7136

Merged

3 tasks

lagru marked this pull request as draft September 18, 2023 16:44

lagru changed the title ~~Use multiprocessing pool instead of dask in face detection gallery example~~ Fix multiprocessing in face detection gallery example Sep 18, 2023

lagru mentioned this pull request Sep 20, 2023

Remove parallelism in face detection gallery example #7141

Merged

lagru mentioned this pull request Sep 20, 2023

Make phase_cross_correlation return ndarray when disambiguate=True #7112

Merged

mkcor pushed a commit that referenced this pull request Sep 21, 2023

Remove parallelism in face detection example (#7141)

b8c7f85

This is a temporary hotfix while [1] is being investigated. Naturally, it has a similar performance compared to the single-threaded Dask solution. [1] #7140

github-actions bot added the 😴 Dormant no recent activity label Mar 21, 2024

github-actions bot removed the 😴 Dormant no recent activity label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multiprocessing in face detection gallery example #7140

Fix multiprocessing in face detection gallery example #7140

lagru commented Sep 18, 2023

lagru commented Sep 18, 2023 •

edited

mkcor commented Sep 19, 2023

lagru commented Sep 20, 2023 •

edited

lagru commented Sep 20, 2023

soupault commented Sep 22, 2023

github-actions bot commented Mar 21, 2024

Fix multiprocessing in face detection gallery example #7140

Are you sure you want to change the base?

Fix multiprocessing in face detection gallery example #7140

Conversation

lagru commented Sep 18, 2023

Description

Checklist

Release note

lagru commented Sep 18, 2023 • edited

mkcor commented Sep 19, 2023

lagru commented Sep 20, 2023 • edited

lagru commented Sep 20, 2023

soupault commented Sep 22, 2023

github-actions bot commented Mar 21, 2024

lagru commented Sep 18, 2023 •

edited

lagru commented Sep 20, 2023 •

edited