Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multiprocessing in face detection gallery example #7140

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lagru
Copy link
Member

@lagru lagru commented Sep 18, 2023

Description

This change is informed by several issues:

The previous version stacked delayed put a delayed call around a generator. Starting with dask 2023.9.2 this lead the traceback attached at the end of this message.

I opted to remove dask completely because even though I managed to get it working with the newer dask version, the task apparently isn't gaining much from multi-threading. Even with only two workers, CPU cores aren't at 100% on my machine. Using a multi-processing scheduler resulted in errors ("worker unexpectetly finished" or something like that). So instead I opted to use Python's default multiprocessing pool which speeds up the example nicely (from 33s to 9s on my machine).

Traceback (most recent call last):
  File "/home/lg/Res/scikit-image/doc/examples/applications/plot_haar_extraction_selection_classification.py", line 81, in <module>
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=150,
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lg/.local/lib/venv/skimagedev/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 211, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lg/.local/lib/venv/skimagedev/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 2614, in train_test_split
    arrays = indexable(*arrays)
             ^^^^^^^^^^^^^^^^^^
  File "/home/lg/.local/lib/venv/skimagedev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 455, in indexable
    check_consistent_length(*result)
  File "/home/lg/.local/lib/venv/skimagedev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 406, in check_consistent_length
    lengths = [_num_samples(X) for X in arrays if X is not None]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lg/.local/lib/venv/skimagedev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 406, in <listcomp>
    lengths = [_num_samples(X) for X in arrays if X is not None]
               ^^^^^^^^^^^^^^^
  File "/home/lg/.local/lib/venv/skimagedev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 347, in _num_samples
    raise TypeError(
TypeError: Singleton array array(<generator object <genexpr> at 0x7f5cee64d2a0>, dtype=object) cannot be considered a valid collection

See also our currently failing CI, e.g. https://github.com/scikit-image/scikit-image/actions/runs/6171455485/job/16883631460#step:5:2520.

Checklist

Release note

Summarize the introduced changes in the code block below in one or a few sentences. The
summary will be included in the next release notes automatically:

...

This change is informed by several issues:

The previous version stacked `delayed` put a delayed call around a
generator. Starting with dask 2023.9.2 this lead the traceback attached
at the end of this message.

I opted to remove dask completely because even though I managed to get
it working with the newer dask version, the task apparently isn't
gaining much from multi-threading. Even with only two workers, CPU cores
aren't at 100% on my machine. Using a multi-processing scheduler
resulted in errors ("worker unexpectetly finished" or something like
that). So instead I opted to use Python's default multiprocessing pool
which speeds up the example nicely (from 33s to 9s on my machine).

Traceback (most recent call last):
  File "/home/lg/Res/scikit-image/doc/examples/applications/plot_haar_extraction_selection_classification.py", line 81, in <module>
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=150,
  [... shortened ...]
  File "/home/lg/.local/lib/venv/skimagedev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 347, in _num_samples
    raise TypeError(
TypeError: Singleton array array(<generator object <genexpr> at 0x7f5cee64d2a0>, dtype=object) cannot be considered a valid collection.
@lagru lagru added the 📄 type: Documentation Updates, fixes and additions to documentation label Sep 18, 2023
@lagru lagru changed the title Use multiprocessing pool instead of dask in plot_haar_extraction_selection_classification.py Use multiprocessing pool instead of dask in face detection gallery example Sep 18, 2023
@lagru
Copy link
Member Author

lagru commented Sep 18, 2023

When trying to use dask.compute(*X, scheduler="processes") I get the following error:

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html

I am not sure how to add a main guard in context of the sphinx gallery example. It runs fine when I copy and run the content of plot_haar_extraction_selection_classification.py into an IPython console. 🤔

@lagru lagru marked this pull request as draft September 18, 2023 16:44
When running the plot_haar_extraction_selection_classification.py
directly it fails with the error attached at the end. However, executing
the file via sphinx gallery or its content in an IPython console seems
to work. So test it in the CI.

RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html
@lagru lagru changed the title Use multiprocessing pool instead of dask in face detection gallery example Fix multiprocessing in face detection gallery example Sep 18, 2023
@mkcor
Copy link
Member

mkcor commented Sep 19, 2023

Thank you so much for tackling this, @lagru! What made you go back from multiprocessing.pool to dask (between d774eeb and 0079f69)?

@lagru
Copy link
Member Author

lagru commented Sep 20, 2023

I am hoping to find a solution that allows for parallelism in this example because the long build time without even using all resources offends me personally. 😅 So I tried multiple approaches (therefore the going back and forth) locally and here:

  • Use multi-threading either with dask or ThreadPool. While this seems to work fine in all scenarios, the parallelized code doesn't seem to gain much from multi-threading. I played around with the number of workers and even with 2 it's slowed down compared to using a single worker.

  • So I tried the multi-processing route instead. But both lead to weird errors for which I don't fully grasp the reasons. Pasting the code into the console runs fine in both cases.

    But with dask-based multi-processing I can't run the example directly without the RuntimeError described in 0079f69. It runs fine when executed indirectly through sphinx, however, this approach leads to the CI hanging forever depending on Python version and architecture.

    With the Pool-based multi-processing (no dask) it works in all local cases but with the same behavior on the CI.

Not sure how to proceed. Right now the only approach that works everywhere is the single-threaded one. But then the example is better off just ripping out the references to dask and parallelism... For now I'll settle on making a hopefully temporary hotfix in another PR and keep this one open for investigation... (see #7141)

lagru added a commit to lagru/scikit-image that referenced this pull request Sep 20, 2023
This is meant to be a temporary hotfix while [1] is being investigated.
It naturally has a similar performance compared to the single-threaded
dask solution.

[1] scikit-image#7140
@lagru
Copy link
Member Author

lagru commented Sep 20, 2023

Aha, I just noticed that 0079f69 is actually passing fine everywhere except for

Some digging with git blame shows that this has been an issue before:

I am also seeing this or similar looking behavior on Linux actually. What I don't understand about this explanation is, how this aspect is managed when running the example through sphinx or through IPython.

@soupault, do you remember the context for testing the gallery example in two different ways on Windows? Examples are run indirectly through sphinx, and directly by passing the files to Python? I guess we want to keep it like that... Though I am noticing that we only do so on Windows. TEST_EXAMPLES is disabled on Linux and MacOS.

mkcor pushed a commit that referenced this pull request Sep 21, 2023
This is a temporary hotfix while [1] is being investigated.
Naturally, it has a similar performance compared to the single-threaded
Dask solution.

[1] #7140
@soupault
Copy link
Member

how this aspect is managed when running the example through sphinx or through IPython.

Sounds like a good topic to investigate further :).

do you remember the context for testing the gallery example in two different ways on Windows? Examples are run indirectly through sphinx, and directly by passing the files to Python?

Oh, I am not really sure. It might have been connected to the way how documentation errors were handled by AppVeyor. Most likely, it was just me making an exact replica of the setup on Travis (Linux/MacOS) once we moved to the fast and unlimited Azure. Building documentation on Windows, at this point, sounds rather unnecessary.

Copy link

Hello scikit-image core devs! There hasn't been any activity on this PR for more than 180 days. I have marked it as "dormant" to make it easy to find.
To our contributors, thank you for your contribution and apologies if this contribution fell through the cracks! Hopefully this ping will help bring some fresh attention to the review. If you need help, you can always reach out on our forum
If you think that this PR is no longer relevant, you may close it, or we may do it at some point (either way, it will be done manually). If you think the PR is valuable but you no longer have the bandwidth to update it, please let us know, so that someone can take it over. 🙏

@github-actions github-actions bot added the 😴 Dormant no recent activity label Mar 21, 2024
@github-actions github-actions bot removed the 😴 Dormant no recent activity label Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 type: Documentation Updates, fixes and additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants