Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use recommended/max team size functions in Cuda ParallelFor and Reduce constructors #6891

Conversation

tcclevenger
Copy link
Contributor

@tcclevenger tcclevenger commented Mar 20, 2024

Fixes #6814. Previous computation of m_team_size in cuda ParallelFor() and ParallelReduce() was intended to match policy.team_size_recommended(), but
missing extra scratch space allocation. Easiest to use team_size_recommended() instead. This matches exactly what is done for HIP backend.

Also, we were manually implementing policy.team_max_size() to verify m_team_size was not too large in ParallelFor(). Use team_max_size() instead.

Previous computation was intended to match team_size_recommended, but
missing extra scratch space allocation.
@tcclevenger tcclevenger force-pushed the recommended_team_size_in_cuda_parallel_constructor branch from 552e471 to 3f1f294 Compare March 20, 2024 22:10
@tcclevenger tcclevenger changed the title Use recommended team size in Cuda ParallelFor and Reduce Use recommended/max team size functions in Cuda ParallelFor and Reduce constructors Mar 20, 2024
m_policy.thread_scratch_size(0)) /
m_vector_size;
m_team_size = m_team_size >= 0 ? m_team_size
: arg_policy.team_size_recommended(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes looks good overall ! (I'm always happier with code reuse 👍)

However, I wonder about one thing that this PR will change.

I think

space().impl_internal_space_instance(), attr, f,
(size_t)impl_vector_length(),
(size_t)team_scratch_size(0) + 2 * sizeof(double),
(size_t)thread_scratch_size(0) + sizeof(double));
will set the team and thread scratch sizes regardless of whether it is a parallel for or a parallel reduce.

But I guess that for a parallel for, there is no need for the tiny additional scratch size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible fix would be to modify the recommended team size function along these lines:

template <class FunctorType, class ParallelTagType>
  int team_size_recommended(const FunctorType& f, const ParallelTagType&) const {
    using closure_type =
        Impl::ParallelFor<FunctorType, TeamPolicy<Properties...>>;
    cudaFuncAttributes attr =
        CudaParallelLaunch<closure_type, typename traits::launch_bounds>::
            get_cuda_func_attributes(space().cuda_device());
    const int block_size =
        Kokkos::Impl::cuda_get_opt_block_size<FunctorType,
                                              typename traits::launch_bounds>(
            space().impl_internal_space_instance(), attr, f,
            (size_t)impl_vector_length(),
            (size_t)team_scratch_size(0) + is it reduce ? 2 * sizeof(double) : 0,
            (size_t)thread_scratch_size(0) + is it reduce ? sizeof(double) : 0);
    return block_size / impl_vector_length();
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even a TeamPolicy parallel_for can use nested parallel_reduce, parallel_scan, team_reduce, or team_scan. The scratch allocation here is independent of the allocation needed for the outer level parallel_reduce.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What Daniel said.

Copy link
Contributor

@romintomasetti romintomasetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One major concern...

m_policy.thread_scratch_size(0)) /
m_vector_size;
m_team_size = m_team_size >= 0 ? m_team_size
: arg_policy.team_size_recommended(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What Daniel said.

@crtrott crtrott merged commit 55c5757 into kokkos:develop Apr 8, 2024
30 of 33 checks passed
@ndellingwood
Copy link
Contributor

Does this merit a changelog entry for 4.4 #6914 ?

@masterleinad
Copy link
Contributor

Does this merit a changelog entry for 4.4 #6914 ?

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Parallel for on Cuda not calling team_size_recommended
5 participants