-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use recommended/max team size functions in Cuda ParallelFor and Reduce constructors #6891
Use recommended/max team size functions in Cuda ParallelFor and Reduce constructors #6891
Conversation
Previous computation was intended to match team_size_recommended, but missing extra scratch space allocation.
552e471
to
3f1f294
Compare
m_policy.thread_scratch_size(0)) / | ||
m_vector_size; | ||
m_team_size = m_team_size >= 0 ? m_team_size | ||
: arg_policy.team_size_recommended( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes looks good overall ! (I'm always happier with code reuse 👍)
However, I wonder about one thing that this PR will change.
I think
kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp
Lines 144 to 147 in 9fff1e0
space().impl_internal_space_instance(), attr, f, | |
(size_t)impl_vector_length(), | |
(size_t)team_scratch_size(0) + 2 * sizeof(double), | |
(size_t)thread_scratch_size(0) + sizeof(double)); |
But I guess that for a parallel for, there is no need for the tiny additional scratch size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible fix would be to modify the recommended team size function along these lines:
template <class FunctorType, class ParallelTagType>
int team_size_recommended(const FunctorType& f, const ParallelTagType&) const {
using closure_type =
Impl::ParallelFor<FunctorType, TeamPolicy<Properties...>>;
cudaFuncAttributes attr =
CudaParallelLaunch<closure_type, typename traits::launch_bounds>::
get_cuda_func_attributes(space().cuda_device());
const int block_size =
Kokkos::Impl::cuda_get_opt_block_size<FunctorType,
typename traits::launch_bounds>(
space().impl_internal_space_instance(), attr, f,
(size_t)impl_vector_length(),
(size_t)team_scratch_size(0) + is it reduce ? 2 * sizeof(double) : 0,
(size_t)thread_scratch_size(0) + is it reduce ? sizeof(double) : 0);
return block_size / impl_vector_length();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even a TeamPolicy
parallel_for
can use nested parallel_reduce
, parallel_scan
, team_reduce
, or team_scan
. The scratch allocation here is independent of the allocation needed for the outer level parallel_reduce
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What Daniel said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One major concern...
m_policy.thread_scratch_size(0)) / | ||
m_vector_size; | ||
m_team_size = m_team_size >= 0 ? m_team_size | ||
: arg_policy.team_size_recommended( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What Daniel said.
Does this merit a changelog entry for 4.4 #6914 ? |
done |
Fixes #6814. Previous computation of
m_team_size
in cudaParallelFor()
andParallelReduce()
was intended to matchpolicy.team_size_recommended()
, butmissing extra scratch space allocation. Easiest to use
team_size_recommended()
instead. This matches exactly what is done for HIP backend.Also, we were manually implementing
policy.team_max_size()
to verifym_team_size
was not too large inParallelFor()
. Useteam_max_size()
instead.