Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminology: n_jobs vs num_workers vs ncpu etc. #4876

Open
emmanuelle opened this issue Jul 31, 2020 · 11 comments · May be fixed by #7302
Open

Terminology: n_jobs vs num_workers vs ncpu etc. #4876

emmanuelle opened this issue Jul 31, 2020 · 11 comments · May be fixed by #7302
Labels
📜 type: API Involves API change(s) 💬 Discussion

Comments

@emmanuelle
Copy link
Member

Since we aim at accelerating some of our functions, one way to do this is to use parallel computing. We already have one function (restoration.cycle_spin) which is using dask.delayed and several threads, there is also the attempt to parallelize segmentation.slic in #3120 which I plan to revive, and probably other functions will have in the future the possibility to use one or more jobs/workers (threads or processes). Therefore, I'm opening this issue so that we can decide what is the best terminology for this parameter. Should it be

  • num_workers (as in cycle_spin, and as in dask)
  • n_jobs (as in scikit-learn and in joblib)
  • something else? (probably not, it's unfortunate enough that joblib and dask have a different convention)
    I think both are fine, the choice should be more about consistency with the rest of the ecosystem: do we want to be more consistent with our backend, or with big brother scikit?
@grlee77
Copy link
Contributor

grlee77 commented Jul 31, 2020

I hate to say it, but there is a third contender: workers

SciPy tends to use just workers rather than num_workers (e.g. in scipy.fft functions, differential_evolution and quad_vec). For FFTs, it also provides a context manager that can be used to control the default number of workers, which I think is nice. See scipy.fft.set_workers. Adding workers to more functions is in their roadmap.

@grlee77
Copy link
Contributor

grlee77 commented Jul 31, 2020

I think I actually like just workers the best, but don't have a problem with any of the three as long as we use it consistently within scikit-image!

@emmanuelle
Copy link
Member Author

Right, workers also makes sense of course. And it's more a question of workers rather than jobs, so for the sake of clarity I'd prefer either workers or num_workers (not proposing n_workers which would be a really good name, but we don't want to multiply the number of names in the ecosystem).

So, who is for workers and who is for num_workers?

@sciunto
Copy link
Member

sciunto commented Aug 12, 2020

Right, workers also makes sense of course. And it's more a question of workers rather than jobs,

Jobbers ? :)

@sciunto
Copy link
Member

sciunto commented Aug 12, 2020

To be more serious, workers looks sufficient, but num_workers is more explicit, I immediately understand that I'm supposed to pass an integer.

About n_* and num_*, we have both in the lib (n_dim, n_bins, n_inliers, n_tiles) and (num_trials, num_shapes, num_peaks, num_channels, ). A preference must be added to #2616

@alexdesiqueira
Copy link
Member

I'd vote to maintain the consistency throughout the packages: maybe workers is the way to go?

@stefanv
Copy link
Member

stefanv commented Jan 19, 2024

To be consistent with the ecosystem, this requires some careful deliberation.

@thomasjpfan wrote up a good overview of the state of the ecosystem. I see he recommends workers as well.

scikit-learn is currently on n_jobs, and scipy on workers.

@lagru
Copy link
Member

lagru commented Jan 22, 2024

Thanks for the link. To quote from that page

We use SciPy’s workers parameter because it is more consistent in controlling the number of cores used. workers denotes any form of parallelism such as: multi-threading, multiprocessing, OpenMP threads, or pthreads.

This argument works just as well in favor of num_workers if you replace "SciPy" with "dask". num_workers seems the more expressive name to me (we are not passing actual workers). But workers is fine with me as well and SciPy is probably closer to us than dask. I really don't care that much about which solution we settle on as long we don't stall this again for a few years. 🤞

@lagru
Copy link
Member

lagru commented Feb 26, 2024

How can we move this along with regards to #7302? I'm happy to involve the ecosystem but how do I do that? According to New SPEC Proposals this might be a good fit for a SPEC. I'll make a post in https://discuss.scientific-python.org/c/specs/ideas/9 if there's no objection.

@stefanv
Copy link
Member

stefanv commented Feb 28, 2024

This does seem like exactly the kind of thing we need to agree on across projects, so +1.

@lagru
Copy link
Member

lagru commented Mar 4, 2024

Posted this as a SPEC idea in Terminology for parameters controlling parallel computation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📜 type: API Involves API change(s) 💬 Discussion
Projects
Development

Successfully merging a pull request may close this issue.

7 participants