New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for CUDA unified memory architectures i.e. Grace Hopper #6823
base: develop
Are you sure you want to change the base?
Conversation
a5cee1f
to
b79efac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also have a CI build that has Kokkos_ENABLE_IMPL_CUDA_EMULATE_UNIFIED_MEMORY=ON
b79efac
to
694585d
Compare
#define GTEST_SKIP_IF_UNIFIED_MEMORY_SPACE \ | ||
GTEST_SKIP() << "skipping since unified memory requires additional fences"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this only check for CudaSpace
?
#define GTEST_SKIP_IF_UNIFIED_MEMORY_SPACE \ | |
GTEST_SKIP() << "skipping since unified memory requires additional fences"; | |
#define GTEST_SKIP_IF_UNIFIED_MEMORY_SPACE \ | |
if constexpr (std::is_same_v<typename TEST_EXECSPACE::memory_space, \ | |
Kokkos::CudaSpace>) | |
GTEST_SKIP() << "skipping since unified memory requires additional fences"; |
Where exactly do the extra fences come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that this fails when we fence consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping
f2b11ca
to
6f820d3
Compare
core/src/Kokkos_Macros.hpp
Outdated
#if defined(KOKKOS_ENABLE_IMPL_CUDA_EMULATE_UNIFIED_MEMORY) | ||
#define KOKKOS_ENABLE_IMPL_CUDA_UNIFIED_MEMORY | ||
#endif | ||
#if defined(KOKKOS_ARCH_ARMV9_GRACE) && defined(KOKKOS_ARCH_HOPPER90) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need both? Isn't checking for KOKKOS_ARCH_HOPPER90
sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think H100 + x86_64 machines have the same KOKKOS_ARCH_HOPPER90
flag set and I am not sure all have HMM enabled.
0a9ff6c
to
8746494
Compare
@@ -184,6 +184,24 @@ void *impl_allocate_common(const int device_id, | |||
cudaError_t error_code = cudaSuccess; | |||
#ifndef CUDART_VERSION | |||
#error CUDART_VERSION undefined! | |||
#elif defined(KOKKOS_ENABLE_IMPL_CUDA_EMULATE_UNIFIED_MEMORY) | |||
// This is inteded to simulate Grace-Hopper like behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This is inteded to simulate Grace-Hopper like behavior | |
// This is intended to simulate Grace-Hopper-like behavior |
#define GTEST_SKIP_IF_UNIFIED_MEMORY_SPACE \ | ||
GTEST_SKIP() << "skipping since unified memory requires additional fences"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping
core/src/Cuda/Kokkos_CudaSpace.cpp
Outdated
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize()); | ||
#elif defined(KOKKOS_ENABLE_IMPL_CUDA_UNIFIED_MEMORY) | ||
// This is intended for Grace-Hopper (and future unified memory architectures) | ||
// The idea is to use host allocator and then adivce to keep it in HBM on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The idea is to use host allocator and then adivce to keep it in HBM on | |
// The idea is to use a host allocator and then advise to keep it in HBM on the |
core/src/Cuda/Kokkos_CudaSpace.cpp
Outdated
#else | ||
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(m_device)); | ||
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaFree(arg_alloc_ptr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened to this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't that the stuff in 367-369?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fallback branch for when KOKKOS_ENABLE_IMPL_CUDA_MALLOC_ASYNC
is false or we have a Cuda version less than 11.2. With the changes here we don't free the memory in this case anymore AFAICT.
// TODO: enable the following when we are sure it is the right thing to do | ||
//#if defined(KOKKOS_ARCH_ARMV9_GRACE) && defined(KOKKOS_ARCH_HOPPER90) | ||
//#define KOKKOS_ENABLE_IMPL_CUDA_UNIFIED_MEMORY | ||
//#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we only care about emulating for now? Or do we want to enable this before merging after testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a cmake option for this so you can enable it explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I still think we should test this in at least one CI build before merging.
8746494
to
3dd44b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address the open conversations in https://github.com/kokkos/kokkos/pull/6823/files#r1515292936 and https://github.com/kokkos/kokkos/pull/6823/files#r1492679604 and fix typos.
We are still waiting on confirmation that this works at all properly, which may require CUDA 12.4 and Drivers 550 |
This is in support of Grace Hopper making, CudaSpace host accessible. I also added an emulation mode to run on other CUDA architectures, by making the cudaMalloc wrapper call cudaMallocManaged. Kokkos_ENABLE_IMPL_CUDA_EMULATE_UNIFIED_MEMORY is the option A new macro KOKKOS_ENABLE_IMPL_CUDA_UNIFIED_MEMORY will be defined if both Grace and Hopper are enabled.
Co-authored-by: Damien L-G <dalg24+github@gmail.com> Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Do not call this function for buffer of size 0.
3dd44b6
to
62ba653
Compare
This PR makes
CudaSpace
host accessible for Grace Hopper. As a consequence functions such ascreate_mirror_view
will not create extra host allocations. To make this happen, I did add Grace as an architecture option as ARMV9_GRACE (following the previous ARMV8_THUNDERX2 scheme).I also added an emulation option for other CUDA based systems which one can enable with
-DKokkos_ENABLE_IMPL_CUDA_EMULATE_UNIFIED_MEMORY
.In that case cudaMalloc is replaces with cudaMallocManaged.