Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base Allreduce Algorithm Selection/Performance Issue #12550

Closed
mjwilkins18 opened this issue May 16, 2024 · 3 comments · Fixed by #12552
Closed

Base Allreduce Algorithm Selection/Performance Issue #12550

mjwilkins18 opened this issue May 16, 2024 · 3 comments · Fixed by #12552

Comments

@mjwilkins18
Copy link

mjwilkins18 commented May 16, 2024

Background information

I was investigating an Allreduce performance issue for 64k processes @ 32kB message size.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v4.1.6 (reproduced w/ v5.0.3)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from tarball

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

N/A

Please describe the system on which you are running

  • Operating system/version: RHEL 8.3
  • Computer hardware: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
  • Network type: Omni-Path

Details of the problem

I traced it back to this conditional in the redscat_allgather algorithm:

if (count < nprocs_pof2 || !ompi_op_is_commute(op)) {
OPAL_OUTPUT((ompi_coll_base_framework.framework_output,
"coll:base:allreduce_intra_redscat_allgather: rank %d/%d "
"count %d switching to basic linear allreduce",
rank, comm_size, count));
return ompi_coll_base_allreduce_intra_basic_linear(sbuf, rbuf, count, dtype,
op, comm, module);
}

The default dtype of the microbenchmark (OSU) is MPI_CHAR, so count = 32kB, and it triggers this conditional. As a result, it falls back to the basic linear algorithm, which performs very poorly at this larger scale. Because this code is in the redscat_allgather function, the algorithm change is triggered even if I force the algorithm selection using the environment variable.

I am happy to open a PR with a fix, but I not sure what the preferred solution is. I understand the need to take care regarding non-commutative operations, but the first half of the if that is causing my problem seems like it should be removed and replicated in the default algorithm selection logic (if at all). Otherwise, could we add an additional requirement that the count is below some value, to avoid scaling issues?

@bosilca
Copy link
Member

bosilca commented May 16, 2024

The decision to call another algorithm under certain conditions is out of scope in a base function, because technically it shall be done by a higher level decision function (such as ompi_coll_tuned_reduce_scatter_block_intra_dec_dynamic).

@wenduwan
Copy link
Contributor

@mjwilkins18 Curious if you could try out #12552 and see if that helps?

@mjwilkins18
Copy link
Author

@wenduwan Apologies, yes that PR fixes my issue, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants