-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profiling: add mark_kernel_static_info #6844
base: develop
Are you sure you want to change the base?
Profiling: add mark_kernel_static_info #6844
Conversation
Adds a mark_kernel_static_info interface to Kokkos Profiling. This interface takes a kernel ID returned from e.g. begin_parallel_for and associates compile-time static information about the kernel with that parallel region. There are 512 bytes reserved for static information, only one field, functor_size, is currently implemented. Kokkos::parallel_for, parallel_reduce, and parallel_scan call this function when profiling is enabled. It is called before scratch allocation profiling in parallel_reduce.
Corresponding tools PR: kokkos/kokkos-tools#242 |
Is it okay to leave |
Are you aware about kokkos/kokkos-tools#238 ? |
Thanks @dalg24 I was thinking to point this out as I was going through this. I think it is related. |
Looking through this and the other code files here and in the Kokkos Tools repo, I think leaving this value as is for the Also, all the CI tests here and in the Kokkos Tools PR have passed, so at least this hasn't caused a problem there. |
The downside to this approach is that all static information that Core wants to pass to tools has to be produced at the same time (or the info struct has to be passed around a bit), and after the kernel launch. Should I refactor this so that there is a single function templated on the That would basically replace these three lines in Kokkos::Tools::KernelStaticInfo info;
info.functor_size = sizeof(FunctorType);
Kokkos::Tools::markKernelStaticInfo(kpID, info); |
I've tested this out with Sparta, Parthenon, and Mini-EM and it works fine with all of them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks reasonable to me but I don't fully understand why we need to forward the functor size to tools. Can you provide some examples?
/** | ||
* Convenience wrapper around kokkosp_mark_kernel_static_info | ||
* | ||
* Consider using markKernelStaticInfo<Functor>(kernelID) instead | ||
*/ | ||
void markKernelStaticInfo(uint64_t kernelID, const KernelStaticInfo& info); | ||
|
||
/** | ||
* Take a kernelID produced by e.g. beginParallelFor | ||
* and associate compile-time information about Functor with it | ||
* | ||
* Arguments: | ||
* | ||
* kernelID: An ID for a parallel loop registered with e.g. beginParallelFor | ||
*/ | ||
template <typename Functor> | ||
void markKernelStaticInfo(uint64_t kernelID) { | ||
Kokkos::Tools::KernelStaticInfo info; | ||
info.functor_size = sizeof(Functor); | ||
markKernelStaticInfo(kernelID, info); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need both overloads? Can't we just inline the first one into the second one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first one does what a lot of the other profiling code does and calls invoke_kokkosp_callback
which is a function template defined in Kokkos_Profiling.cpp
and a declaration is not available in this header. I could move the entire implementation of that function into this header and do it the way you're suggesting if you prefer.
I think my comment here can help answer this: kokkos/kokkos-tools#242 (comment). If you want to know about why specifically functor size in this PR, and not any other (static or dynamic) information of a Kokkos kernel during a Kokkos application execution, then ask @cwpearson. We can put in other possibly useful information in this, but I was thinking just using functor size as a starting point since it is useful for @cwpearson and that he has experimented with and used it in his application. |
The only use case I currently have is so Core developers can gather information about how large the functors in applications actually are to guide efforts in designing or optimizing kernel launch mechanisms. Here's an example of the gathered information from the Kokkos-enabled open-source LANL ATS Benchmarks (github repo) Parthenon
Sparta
mini-em
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally fine with the direction of this pull request as long as we consider this feature as experimental so that we can change what we store. The problem is that the callback can't define what is to be captured but we have to do that internally which limits the flexibility.
Would you prefer a separate profiling interface function for each separate piece of information we might want to capture? That's a relatively easy change (and how I envisioned it originally). |
I think a separate profiling interface function is OK. Yes, this information you are gathering requires Kokkos Tools to hook into Kokkos core. I think you would add it as a function |
Adds a mark_kernel_static_info interface to Kokkos Profiling. This interface takes a kernel ID returned from e.g. begin_parallel_for and associates compile-time static information about the kernel with that parallel region. There are 512 bytes reserved for static information, only one field, functor_size, is currently implemented.
Kokkos::parallel_for, parallel_reduce, and parallel_scan call this function when profiling is enabled. It is called before scratch allocation profiling in parallel_reduce.