Use recommended/max team size functions in Cuda ParallelFor and Reduce constructors #6891

tcclevenger · 2024-03-20T22:00:19Z

Fixes #6814. Previous computation of m_team_size in cuda ParallelFor() and ParallelReduce() was intended to match policy.team_size_recommended(), but
missing extra scratch space allocation. Easiest to use team_size_recommended() instead. This matches exactly what is done for HIP backend.

Also, we were manually implementing policy.team_max_size() to verify m_team_size was not too large in ParallelFor(). Use team_max_size() instead.

Previous computation was intended to match team_size_recommended, but missing extra scratch space allocation.

romintomasetti · 2024-03-21T07:42:38Z

core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp

-                  m_policy.thread_scratch_size(0)) /
-                  m_vector_size;
+    m_team_size = m_team_size >= 0 ? m_team_size
+                                   : arg_policy.team_size_recommended(


The changes looks good overall ! (I'm always happier with code reuse 👍)

However, I wonder about one thing that this PR will change.

I think

kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp

Lines 144 to 147 in 9fff1e0

space().impl_internal_space_instance(), attr, f,

(size_t)impl_vector_length(),

(size_t)team_scratch_size(0) + 2 * sizeof(double),

(size_t)thread_scratch_size(0) + sizeof(double));

will set the team and thread scratch sizes regardless of whether it is a parallel for or a parallel reduce.

But I guess that for a parallel for, there is no need for the tiny additional scratch size.

A possible fix would be to modify the recommended team size function along these lines:

template <class FunctorType, class ParallelTagType> int team_size_recommended(const FunctorType& f, const ParallelTagType&) const { using closure_type = Impl::ParallelFor<FunctorType, TeamPolicy<Properties...>>; cudaFuncAttributes attr = CudaParallelLaunch<closure_type, typename traits::launch_bounds>:: get_cuda_func_attributes(space().cuda_device()); const int block_size = Kokkos::Impl::cuda_get_opt_block_size<FunctorType, typename traits::launch_bounds>( space().impl_internal_space_instance(), attr, f, (size_t)impl_vector_length(), (size_t)team_scratch_size(0) + is it reduce ? 2 * sizeof(double) : 0, (size_t)thread_scratch_size(0) + is it reduce ? sizeof(double) : 0); return block_size / impl_vector_length(); }

Even a TeamPolicy parallel_for can use nested parallel_reduce, parallel_scan, team_reduce, or team_scan. The scratch allocation here is independent of the allocation needed for the outer level parallel_reduce.

What Daniel said.

romintomasetti

One major concern...

crtrott · 2024-04-08T17:02:50Z

core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp

-                  m_policy.thread_scratch_size(0)) /
-                  m_vector_size;
+    m_team_size = m_team_size >= 0 ? m_team_size
+                                   : arg_policy.team_size_recommended(


What Daniel said.

ndellingwood · 2024-05-01T17:20:46Z

Does this merit a changelog entry for 4.4 #6914 ?

masterleinad · 2024-05-01T17:47:16Z

Does this merit a changelog entry for 4.4 #6914 ?

done

tcclevenger added 2 commits March 20, 2024 15:46

Use team_size_recommended in cuda ParallelFor constructor

51779c7

Previous computation was intended to match team_size_recommended, but missing extra scratch space allocation.

Same change for cuda ParallelReduce()

ab5efde

tcclevenger added the Backend - CUDA label Mar 20, 2024

tcclevenger self-assigned this Mar 20, 2024

tcclevenger mentioned this pull request Mar 20, 2024

[Bug] Parallel for on Cuda not calling team_size_recommended #6814

Closed

Remove unused attr variable

3f1f294

tcclevenger force-pushed the recommended_team_size_in_cuda_parallel_constructor branch from 552e471 to 3f1f294 Compare March 20, 2024 22:10

Use team_size_max() in pfor constructor instead of recomputing

a440958

tcclevenger changed the title ~~Use recommended team size in Cuda ParallelFor and Reduce~~ Use recommended/max team size functions in Cuda ParallelFor and Reduce constructors Mar 20, 2024

romintomasetti reviewed Mar 21, 2024

View reviewed changes

romintomasetti requested changes Mar 21, 2024

View reviewed changes

masterleinad approved these changes Mar 21, 2024

View reviewed changes

crtrott approved these changes Apr 8, 2024

View reviewed changes

crtrott merged commit 55c5757 into kokkos:develop Apr 8, 2024
30 of 33 checks passed

masterleinad mentioned this pull request May 1, 2024

CHANGELOG for 4.4 #6914

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use recommended/max team size functions in Cuda ParallelFor and Reduce constructors #6891

Use recommended/max team size functions in Cuda ParallelFor and Reduce constructors #6891

tcclevenger commented Mar 20, 2024 •

edited by masterleinad

romintomasetti Mar 21, 2024

romintomasetti Mar 21, 2024

masterleinad Mar 21, 2024

crtrott Apr 8, 2024

romintomasetti left a comment

crtrott Apr 8, 2024

ndellingwood commented May 1, 2024

masterleinad commented May 1, 2024

	space().impl_internal_space_instance(), attr, f,
	(size_t)impl_vector_length(),
	(size_t)team_scratch_size(0) + 2 * sizeof(double),
	(size_t)thread_scratch_size(0) + sizeof(double));

Use recommended/max team size functions in Cuda ParallelFor and Reduce constructors #6891

Use recommended/max team size functions in Cuda ParallelFor and Reduce constructors #6891

Conversation

tcclevenger commented Mar 20, 2024 • edited by masterleinad

romintomasetti Mar 21, 2024

Choose a reason for hiding this comment

romintomasetti Mar 21, 2024

Choose a reason for hiding this comment

masterleinad Mar 21, 2024

Choose a reason for hiding this comment

crtrott Apr 8, 2024

Choose a reason for hiding this comment

romintomasetti left a comment

Choose a reason for hiding this comment

crtrott Apr 8, 2024

Choose a reason for hiding this comment

ndellingwood commented May 1, 2024

masterleinad commented May 1, 2024

tcclevenger commented Mar 20, 2024 •

edited by masterleinad