Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CUDA unified memory architectures i.e. Grace Hopper #6823

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

crtrott
Copy link
Member

@crtrott crtrott commented Feb 16, 2024

This PR makes CudaSpace host accessible for Grace Hopper. As a consequence functions such as create_mirror_view will not create extra host allocations. To make this happen, I did add Grace as an architecture option as ARMV9_GRACE (following the previous ARMV8_THUNDERX2 scheme).

I also added an emulation option for other CUDA based systems which one can enable with
-DKokkos_ENABLE_IMPL_CUDA_EMULATE_UNIFIED_MEMORY.
In that case cudaMalloc is replaces with cudaMallocManaged.

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also have a CI build that has Kokkos_ENABLE_IMPL_CUDA_EMULATE_UNIFIED_MEMORY=ON

Makefile.kokkos Outdated Show resolved Hide resolved
Makefile.kokkos Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_CudaSpace.hpp Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_Instance.hpp Outdated Show resolved Hide resolved
core/src/Kokkos_Macros.hpp Outdated Show resolved Hide resolved
core/unit_test/cuda/TestCuda_Spaces.cpp Show resolved Hide resolved
containers/unit_tests/TestWithoutInitializing.hpp Outdated Show resolved Hide resolved
cmake/kokkos_enable_devices.cmake Outdated Show resolved Hide resolved
cmake/kokkos_enable_devices.cmake Outdated Show resolved Hide resolved
#define GTEST_SKIP_IF_UNIFIED_MEMORY_SPACE \
GTEST_SKIP() << "skipping since unified memory requires additional fences";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this only check for CudaSpace?

Suggested change
#define GTEST_SKIP_IF_UNIFIED_MEMORY_SPACE \
GTEST_SKIP() << "skipping since unified memory requires additional fences";
#define GTEST_SKIP_IF_UNIFIED_MEMORY_SPACE \
if constexpr (std::is_same_v<typename TEST_EXECSPACE::memory_space, \
Kokkos::CudaSpace>)
GTEST_SKIP() << "skipping since unified memory requires additional fences";

Where exactly do the extra fences come from?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that this fails when we fence consistently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

cmake/kokkos_enable_devices.cmake Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_CudaSpace.hpp Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_Instance.hpp Outdated Show resolved Hide resolved
#if defined(KOKKOS_ENABLE_IMPL_CUDA_EMULATE_UNIFIED_MEMORY)
#define KOKKOS_ENABLE_IMPL_CUDA_UNIFIED_MEMORY
#endif
#if defined(KOKKOS_ARCH_ARMV9_GRACE) && defined(KOKKOS_ARCH_HOPPER90)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need both? Isn't checking for KOKKOS_ARCH_HOPPER90 sufficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think H100 + x86_64 machines have the same KOKKOS_ARCH_HOPPER90 flag set and I am not sure all have HMM enabled.

@crtrott crtrott force-pushed the add-cuda-unified-memory-arch branch 2 times, most recently from 0a9ff6c to 8746494 Compare March 5, 2024 06:04
@@ -184,6 +184,24 @@ void *impl_allocate_common(const int device_id,
cudaError_t error_code = cudaSuccess;
#ifndef CUDART_VERSION
#error CUDART_VERSION undefined!
#elif defined(KOKKOS_ENABLE_IMPL_CUDA_EMULATE_UNIFIED_MEMORY)
// This is inteded to simulate Grace-Hopper like behavior
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This is inteded to simulate Grace-Hopper like behavior
// This is intended to simulate Grace-Hopper-like behavior

#define GTEST_SKIP_IF_UNIFIED_MEMORY_SPACE \
GTEST_SKIP() << "skipping since unified memory requires additional fences";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize());
#elif defined(KOKKOS_ENABLE_IMPL_CUDA_UNIFIED_MEMORY)
// This is intended for Grace-Hopper (and future unified memory architectures)
// The idea is to use host allocator and then adivce to keep it in HBM on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The idea is to use host allocator and then adivce to keep it in HBM on
// The idea is to use a host allocator and then advise to keep it in HBM on the

Comment on lines 359 to 361
#else
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(m_device));
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaFree(arg_alloc_ptr));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to this branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't that the stuff in 367-369?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fallback branch for when KOKKOS_ENABLE_IMPL_CUDA_MALLOC_ASYNC is false or we have a Cuda version less than 11.2. With the changes here we don't free the memory in this case anymore AFAICT.

core/src/Cuda/Kokkos_CudaSpace.hpp Show resolved Hide resolved
Comment on lines +540 to +543
// TODO: enable the following when we are sure it is the right thing to do
//#if defined(KOKKOS_ARCH_ARMV9_GRACE) && defined(KOKKOS_ARCH_HOPPER90)
//#define KOKKOS_ENABLE_IMPL_CUDA_UNIFIED_MEMORY
//#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we only care about emulating for now? Or do we want to enable this before merging after testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a cmake option for this so you can enable it explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I still think we should test this in at least one CI build before merging.

@crtrott crtrott force-pushed the add-cuda-unified-memory-arch branch from 8746494 to 3dd44b6 Compare March 7, 2024 18:04
Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crtrott
Copy link
Member Author

crtrott commented Mar 14, 2024

We are still waiting on confirmation that this works at all properly, which may require CUDA 12.4 and Drivers 550

crtrott and others added 7 commits April 29, 2024 13:27
This is in support of Grace Hopper making, CudaSpace host accessible.
I also added an emulation mode to run on other CUDA architectures,
by making the cudaMalloc wrapper call cudaMallocManaged.

Kokkos_ENABLE_IMPL_CUDA_EMULATE_UNIFIED_MEMORY is the option

A new macro KOKKOS_ENABLE_IMPL_CUDA_UNIFIED_MEMORY will be defined
if both Grace and Hopper are enabled.
Co-authored-by: Damien L-G <dalg24+github@gmail.com>
Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Do not call this function for buffer of size 0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants