Profiling: add mark_kernel_static_info #6844

cwpearson · 2024-02-28T19:29:55Z

Adds a mark_kernel_static_info interface to Kokkos Profiling. This interface takes a kernel ID returned from e.g. begin_parallel_for and associates compile-time static information about the kernel with that parallel region. There are 512 bytes reserved for static information, only one field, functor_size, is currently implemented.

Kokkos::parallel_for, parallel_reduce, and parallel_scan call this function when profiling is enabled. It is called before scratch allocation profiling in parallel_reduce.

Adds a mark_kernel_static_info interface to Kokkos Profiling. This interface takes a kernel ID returned from e.g. begin_parallel_for and associates compile-time static information about the kernel with that parallel region. There are 512 bytes reserved for static information, only one field, functor_size, is currently implemented. Kokkos::parallel_for, parallel_reduce, and parallel_scan call this function when profiling is enabled. It is called before scratch allocation profiling in parallel_reduce.

cwpearson · 2024-02-28T19:30:36Z

Corresponding tools PR: kokkos/kokkos-tools#242

cwpearson · 2024-02-28T19:52:53Z

Is it okay to leave #define KOKKOSP_INTERFACE_VERSION 20211015 or does it need to be incremented?

dalg24 · 2024-02-29T00:02:07Z

Are you aware about kokkos/kokkos-tools#238 ?

vlkale · 2024-02-29T14:54:55Z

Are you aware about kokkos/kokkos-tools#238 ?

Thanks @dalg24 I was thinking to point this out as I was going through this. I think it is related.

vlkale · 2024-02-29T17:07:21Z

Is it okay to leave #define KOKKOSP_INTERFACE_VERSION 20211015 or does it need to be incremented?

Looking through this and the other code files here and in the Kokkos Tools repo, I think leaving this value as is for the #define should be fine.

Also, all the CI tests here and in the Kokkos Tools PR have passed, so at least this hasn't caused a problem there.

cwpearson · 2024-02-29T19:57:05Z

The downside to this approach is that all static information that Core wants to pass to tools has to be produced at the same time (or the info struct has to be passed around a bit), and after the kernel launch.

Should I refactor this so that there is a single function templated on the Functor type that serves as one place where all information for the static profiler is generated?

That would basically replace these three lines in impl/Kokkos_Tools_Generic but would be a single point to extended with any future static information

    Kokkos::Tools::KernelStaticInfo info;
    info.functor_size = sizeof(FunctorType);
    Kokkos::Tools::markKernelStaticInfo(kpID, info);

… info

cwpearson · 2024-03-12T15:28:36Z

I've tested this out with Sparta, Parthenon, and Mini-EM and it works fine with all of them.

masterleinad

The implementation looks reasonable to me but I don't fully understand why we need to forward the functor size to tools. Can you provide some examples?

core/unit_test/tools/TestToolsInitialization.cpp

masterleinad · 2024-03-12T15:34:16Z

core/src/impl/Kokkos_Profiling.hpp

+/**
+ * Convenience wrapper around kokkosp_mark_kernel_static_info
+ *
+ * Consider using markKernelStaticInfo<Functor>(kernelID) instead
+ */
+void markKernelStaticInfo(uint64_t kernelID, const KernelStaticInfo& info);
+
+/**
+ * Take a kernelID produced by e.g. beginParallelFor
+ * and associate compile-time information about Functor with it
+ *
+ * Arguments:
+ *
+ * kernelID: An ID for a parallel loop registered with e.g. beginParallelFor
+ */
+template <typename Functor>
+void markKernelStaticInfo(uint64_t kernelID) {
+  Kokkos::Tools::KernelStaticInfo info;
+  info.functor_size = sizeof(Functor);
+  markKernelStaticInfo(kernelID, info);
+}
+


Do we really need both overloads? Can't we just inline the first one into the second one?

first one does what a lot of the other profiling code does and calls invoke_kokkosp_callback which is a function template defined in Kokkos_Profiling.cpp and a declaration is not available in this header. I could move the entire implementation of that function into this header and do it the way you're suggesting if you prefer.

vlkale · 2024-03-14T20:01:38Z

The implementation looks reasonable to me but I don't fully understand why we need to forward the functor size to tools. Can you provide some examples?

@masterleinad

I think my comment here can help answer this: kokkos/kokkos-tools#242 (comment).

If you want to know about why specifically functor size in this PR, and not any other (static or dynamic) information of a Kokkos kernel during a Kokkos application execution, then ask @cwpearson. We can put in other possibly useful information in this, but I was thinking just using functor size as a starting point since it is useful for @cwpearson and that he has experimented with and used it in his application.

cwpearson · 2024-03-19T16:01:04Z

The implementation looks reasonable to me but I don't fully understand why we need to forward the functor size to tools. Can you provide some examples?

The only use case I currently have is so Core developers can gather information about how large the functors in applications actually are to guide efforts in designing or optimizing kernel launch mechanisms.

Here's an example of the gathered information from the Kokkos-enabled open-source LANL ATS Benchmarks (github repo)

Parthenon

Functor Size	Execution Count	Name
48	18832	refinement_package.cpp::98::FirstDerivative
112	126	pr_loops.hpp::127::ProlongationRestrictionLoop
24	89	boundary_communication.cpp::263::SetBounds
80	49	burgers_package.cpp::155::CalculateDerived
48	49	boundary_communication.cpp::93::SendBoundBufs
136	40	burgers_package.cpp::309::CalculateFluxes
136	40	burgers_package.cpp::238::CalculateFluxes
32	40	flux_correction.cpp::70::LoadAndSendFluxCorrections
24	40	flux_correction.cpp::166::SetFluxCorrections
56	25	burgers::EstimateTimestep
104	16	MassHistory

Sparta

Functor Size	Execution Count	Name
72	400	N9SPARTA_NS8ExclScanIN6Kokkos6OpenMPEEE
1856	213	N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS15TagParticleSortILi1ELi0EEE
39224	200	N9SPARTA_NS12UpdateKokkosE/N9SPARTA_NS13TagUpdateMoveILi2ELi1ELi0ELin1EEE
5536	200	N9SPARTA_NS16CollideVSSKokkosE/N9SPARTA_NS23TagCollideCollisionsOneILi1ELin1EEE
3600	200	N9SPARTA_NS17FixEmitFaceKokkosE/N9SPARTA_NS22TagFixEmitFace_ninsertE
3600	200	N9SPARTA_NS17FixEmitFaceKokkosE/N9SPARTA_NS27TagFixEmitFace_perform_taskE
1856	200	N9SPARTA_NS14ParticleKokkosE/N9SPARTA_NS28TagParticleCompressReactionsE
464	200	ZN9SPARTA_NS17FixEmitFaceKokkos12perform_taskEvEUliE_
48	119	Kokkos::ViewCopy-1D
80	42	Kokkos::ViewCopy-2D
32	12	Kokkos::ViewFill-1D
16	8	Kokkos::Impl::host_space_deepcopy_double
112	6	Kokkos::ViewCopy-3D
5536	5	N9SPARTA_NS16CollideVSSKokkosE/N9SPARTA_NS21TagCollideResetVremaxE
...	...	...

mini-em

Functor Size	Execution Count	Name
224	10753152	N9Intrepid24Impl22Basis_HGRAD_TET_C1_FEM7FunctorIN6Kokkos11DynRankViewIdJNS3_12LayoutStrideENS3_6OpenMPEEEES7_LNS_9EOperatorE1EEE
48	3816344	Kokkos::ViewCopy-1D
224	1538305	N9Intrepid24Impl22Basis_HGRAD_TET_C1_FEM7FunctorIN6Kokkos11DynRankViewIdJNS3_12LayoutStrideENS3_6OpenMPEEEES7_LNS_9EOperatorE0EEE
112	311040	N6panzer13GlobalIndexer19CopyCellLIDsFunctorIN6Kokkos4ViewIPPiJNS2_11LayoutRightENS2_6OpenMPEEEEEE
152	230400	ZN6panzer28GatherSolution_BlockedTpetraINS_6Traits8ResidualES1_dixN6Tpetra12KokkosCompat23KokkosDeviceWrapperNodeIN6Kokkos6OpenMPENS6_9HostSpaceEEEE14evaluateFieldsERKNS_7WorksetEEUlRKiE_
664	115200	Panzer_Integrator_BasisTimesVector<0>
144	115200	N6panzer17V_MultiplyFunctorILi2EdEE
1184	76800	N6panzer9SumStaticINS_6Traits8ResidualES1_NS_4CellENS_5BASISEvEE/N6panzer9SumStaticINS_6Traits8ResidualES1_NS_4CellENS_5BASISEvE12NoScalarsTagE
288	76800	ZN6panzer10DotProductINS_6Traits8ResidualES1_E14evaluateFieldsERKNS_7WorksetEEUliE_
264	76800	IntegratorScalar
144	76800	DOF: B_face (panzer::Traits::Residual)
144	76800	DOF: E_edge (panzer::Traits::Residual)
144	76800	ZN6panzer29ScatterResidual_BlockedTpetraINS_6Traits8ResidualES1_ixN6Tpetra12KokkosCompat23KokkosDeviceWrapperNodeIN6Kokkos6OpenMPENS6_9HostSpaceEEEE14evaluateFieldsERKNS_7WorksetEEUlRKiE_
32	65266	Kokkos::ViewFill-1D
...	...	...

masterleinad

I'm generally fine with the direction of this pull request as long as we consider this feature as experimental so that we can change what we store. The problem is that the callback can't define what is to be captured but we have to do that internally which limits the flexibility.

cwpearson · 2024-03-22T20:21:38Z

Would you prefer a separate profiling interface function for each separate piece of information we might want to capture? That's a relatively easy change (and how I envisioned it originally).

vlkale · 2024-03-26T16:43:33Z

Would you prefer a separate profiling interface function for each separate piece of information we might want to capture? That's a relatively easy change (and how I envisioned it originally).

I think a separate profiling interface function is OK. Yes, this information you are gathering requires Kokkos Tools to hook into Kokkos core.

I think you would add it as a function Kokkos Tools_ToolsProgrammingInterface struct. Note that the only other function there is the Kokkos Tools tool-invoked fence function.
There is plenty of space for other functions (about 63 slots). I think we should think wisely on what other functions there should be. I don't know if that is what you were thinking of when you mentioned tool programming interface, but that is how I would approach this.

cwpearson requested review from masterleinad, crtrott and vlkale February 28, 2024 19:29

cwpearson self-assigned this Feb 28, 2024

cwpearson mentioned this pull request Feb 28, 2024

Add kp_functor_size: print parallel functor sizes kokkos/kokkos-tools#242

Open

Profiling: mark_kernel_static_info: no anonymous structs

666e446

Profiling: mark_kernel_static_info: Single point for gathering static…

32855a7

… info

masterleinad reviewed Mar 12, 2024

View reviewed changes

masterleinad reviewed Mar 19, 2024

View reviewed changes

TestToolsInitialization.cpp: comment typo

f814d21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling: add mark_kernel_static_info #6844

Profiling: add mark_kernel_static_info #6844

cwpearson commented Feb 28, 2024

cwpearson commented Feb 28, 2024

cwpearson commented Feb 28, 2024

dalg24 commented Feb 29, 2024

vlkale commented Feb 29, 2024

vlkale commented Feb 29, 2024

cwpearson commented Feb 29, 2024

cwpearson commented Mar 12, 2024

masterleinad left a comment

masterleinad Mar 12, 2024

masterleinad Mar 19, 2024

cwpearson Mar 22, 2024

vlkale commented Mar 14, 2024

cwpearson commented Mar 19, 2024

masterleinad left a comment

cwpearson commented Mar 22, 2024

vlkale commented Mar 26, 2024

Profiling: add mark_kernel_static_info #6844

Are you sure you want to change the base?

Profiling: add mark_kernel_static_info #6844

Conversation

cwpearson commented Feb 28, 2024

cwpearson commented Feb 28, 2024

cwpearson commented Feb 28, 2024

dalg24 commented Feb 29, 2024

vlkale commented Feb 29, 2024

vlkale commented Feb 29, 2024

cwpearson commented Feb 29, 2024

cwpearson commented Mar 12, 2024

masterleinad left a comment

Choose a reason for hiding this comment

masterleinad Mar 12, 2024

Choose a reason for hiding this comment

masterleinad Mar 19, 2024

Choose a reason for hiding this comment

cwpearson Mar 22, 2024

Choose a reason for hiding this comment

vlkale commented Mar 14, 2024

cwpearson commented Mar 19, 2024

Parthenon

Sparta

mini-em

masterleinad left a comment

Choose a reason for hiding this comment

cwpearson commented Mar 22, 2024

vlkale commented Mar 26, 2024