Add support for CUDA unified memory architectures i.e. Grace Hopper #6823

crtrott · 2024-02-16T00:56:15Z

This PR makes CudaSpace host accessible for Grace Hopper. As a consequence functions such as create_mirror_view will not create extra host allocations. To make this happen, I did add Grace as an architecture option as ARMV9_GRACE (following the previous ARMV8_THUNDERX2 scheme).

I also added an emulation option for other CUDA based systems which one can enable with
-DKokkos_ENABLE_IMPL_CUDA_EMULATE_UNIFIED_MEMORY.
In that case cudaMalloc is replaces with cudaMallocManaged.

cmake/kokkos_arch.cmake

cmake/kokkos_enable_options.cmake

masterleinad

We should also have a CI build that has Kokkos_ENABLE_IMPL_CUDA_EMULATE_UNIFIED_MEMORY=ON

Makefile.kokkos

core/src/Cuda/Kokkos_CudaSpace.hpp

core/src/Cuda/Kokkos_Cuda_Instance.hpp

core/src/Kokkos_Macros.hpp

core/unit_test/cuda/TestCuda_Spaces.cpp

containers/unit_tests/TestWithoutInitializing.hpp

cmake/kokkos_enable_devices.cmake

masterleinad · 2024-02-16T16:00:11Z

containers/unit_tests/TestWithoutInitializing.hpp

+#define GTEST_SKIP_IF_UNIFIED_MEMORY_SPACE \
+  GTEST_SKIP() << "skipping since unified memory requires additional fences";


Shouldn't this only check for CudaSpace?

Suggested change

#define GTEST_SKIP_IF_UNIFIED_MEMORY_SPACE \

GTEST_SKIP() << "skipping since unified memory requires additional fences";

#define GTEST_SKIP_IF_UNIFIED_MEMORY_SPACE \

if constexpr (std::is_same_v<typename TEST_EXECSPACE::memory_space, \

Kokkos::CudaSpace>)

GTEST_SKIP() << "skipping since unified memory requires additional fences";

Where exactly do the extra fences come from?

I can see that this fails when we fence consistently.

cmake/kokkos_enable_devices.cmake

core/src/Cuda/Kokkos_CudaSpace.cpp

core/src/Cuda/Kokkos_CudaSpace.hpp

core/src/Cuda/Kokkos_Cuda_Instance.hpp

cmake/KokkosCore_config.h.in

masterleinad · 2024-02-23T16:01:02Z

core/src/Kokkos_Macros.hpp

+#if defined(KOKKOS_ENABLE_IMPL_CUDA_EMULATE_UNIFIED_MEMORY)
+#define KOKKOS_ENABLE_IMPL_CUDA_UNIFIED_MEMORY
+#endif
+#if defined(KOKKOS_ARCH_ARMV9_GRACE) && defined(KOKKOS_ARCH_HOPPER90)


Do we really need both? Isn't checking for KOKKOS_ARCH_HOPPER90 sufficient?

I think H100 + x86_64 machines have the same KOKKOS_ARCH_HOPPER90 flag set and I am not sure all have HMM enabled.

masterleinad · 2024-03-06T13:33:40Z

core/src/Cuda/Kokkos_CudaSpace.cpp

@@ -184,6 +184,24 @@ void *impl_allocate_common(const int device_id,
  cudaError_t error_code = cudaSuccess;
 #ifndef CUDART_VERSION
 #error CUDART_VERSION undefined!
+#elif defined(KOKKOS_ENABLE_IMPL_CUDA_EMULATE_UNIFIED_MEMORY)
+  // This is inteded to simulate Grace-Hopper like behavior


Suggested change

// This is inteded to simulate Grace-Hopper like behavior

// This is intended to simulate Grace-Hopper-like behavior

masterleinad · 2024-03-06T13:33:46Z

containers/unit_tests/TestWithoutInitializing.hpp

+#define GTEST_SKIP_IF_UNIFIED_MEMORY_SPACE \
+  GTEST_SKIP() << "skipping since unified memory requires additional fences";


masterleinad · 2024-03-06T13:34:11Z

core/src/Cuda/Kokkos_CudaSpace.cpp

+  KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize());
+#elif defined(KOKKOS_ENABLE_IMPL_CUDA_UNIFIED_MEMORY)
+  // This is intended for Grace-Hopper (and future unified memory architectures)
+  // The idea is to use host allocator and then adivce to keep it in HBM on


Suggested change

// The idea is to use host allocator and then adivce to keep it in HBM on

// The idea is to use a host allocator and then advise to keep it in HBM on the

masterleinad · 2024-03-06T13:36:10Z

core/src/Cuda/Kokkos_CudaSpace.cpp

-#else
-    KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(m_device));
-    KOKKOS_IMPL_CUDA_SAFE_CALL(cudaFree(arg_alloc_ptr));


What happened to this branch?

isn't that the stuff in 367-369?

This is the fallback branch for when KOKKOS_ENABLE_IMPL_CUDA_MALLOC_ASYNC is false or we have a Cuda version less than 11.2. With the changes here we don't free the memory in this case anymore AFAICT.

core/src/Cuda/Kokkos_CudaSpace.hpp

masterleinad · 2024-03-06T13:41:56Z

core/src/Kokkos_Macros.hpp

+// TODO: enable the following when we are sure it is the right thing to do
+//#if defined(KOKKOS_ARCH_ARMV9_GRACE) && defined(KOKKOS_ARCH_HOPPER90)
+//#define KOKKOS_ENABLE_IMPL_CUDA_UNIFIED_MEMORY
+//#endif


So we only care about emulating for now? Or do we want to enable this before merging after testing?

I added a cmake option for this so you can enable it explicitly.

OK. I still think we should test this in at least one CI build before merging.

masterleinad

Please address the open conversations in https://github.com/kokkos/kokkos/pull/6823/files#r1515292936 and https://github.com/kokkos/kokkos/pull/6823/files#r1492679604 and fix typos.

crtrott · 2024-03-14T22:08:37Z

We are still waiting on confirmation that this works at all properly, which may require CUDA 12.4 and Drivers 550

This is in support of Grace Hopper making, CudaSpace host accessible. I also added an emulation mode to run on other CUDA architectures, by making the cudaMalloc wrapper call cudaMallocManaged. Kokkos_ENABLE_IMPL_CUDA_EMULATE_UNIFIED_MEMORY is the option A new macro KOKKOS_ENABLE_IMPL_CUDA_UNIFIED_MEMORY will be defined if both Grace and Hopper are enabled.

Co-authored-by: Damien L-G <dalg24+github@gmail.com> Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

Do not call this function for buffer of size 0.

crtrott force-pushed the add-cuda-unified-memory-arch branch from a5cee1f to b79efac Compare February 16, 2024 01:01

masterleinad reviewed Feb 16, 2024

View reviewed changes

cmake/kokkos_arch.cmake Outdated Show resolved Hide resolved

masterleinad reviewed Feb 16, 2024

View reviewed changes

cmake/kokkos_enable_options.cmake Outdated Show resolved Hide resolved

masterleinad reviewed Feb 16, 2024

View reviewed changes

dalg24 reviewed Feb 16, 2024

View reviewed changes

crtrott force-pushed the add-cuda-unified-memory-arch branch from b79efac to 694585d Compare February 16, 2024 16:05

masterleinad reviewed Feb 16, 2024

View reviewed changes

crtrott force-pushed the add-cuda-unified-memory-arch branch from f2b11ca to 6f820d3 Compare February 16, 2024 17:18

cedricchevalier19 reviewed Feb 23, 2024

View reviewed changes

cmake/KokkosCore_config.h.in Show resolved Hide resolved

masterleinad reviewed Feb 23, 2024

View reviewed changes

ajpowelsnl mentioned this pull request Feb 28, 2024

Release Themes for 2024 #6804

Open

crtrott force-pushed the add-cuda-unified-memory-arch branch 2 times, most recently from 0a9ff6c to 8746494 Compare March 5, 2024 06:04

masterleinad reviewed Mar 6, 2024

View reviewed changes

crtrott force-pushed the add-cuda-unified-memory-arch branch from 8746494 to 3dd44b6 Compare March 7, 2024 18:04

masterleinad requested changes Mar 11, 2024

View reviewed changes

crtrott and others added 7 commits April 29, 2024 13:27

Add Grace CPU architecture

cec0183

Fix tests for Unified Memory option being on

da17c27

Fix tests for Unified Memory option being on

c1c714b

Apply suggestions from code review

e781313

Co-authored-by: Damien L-G <dalg24+github@gmail.com> Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

Print configuration for UNIFIED_MEMORY

0f25d2b

fix: CudaMemAdvise for Grace-Hopper

62ba653

Do not call this function for buffer of size 0.

cedricchevalier19 force-pushed the add-cuda-unified-memory-arch branch from 3dd44b6 to 62ba653 Compare May 14, 2024 09:16

Fix format and wording

c1b7572

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for CUDA unified memory architectures i.e. Grace Hopper #6823

Add support for CUDA unified memory architectures i.e. Grace Hopper #6823

crtrott commented Feb 16, 2024 •

edited

masterleinad left a comment

masterleinad Feb 16, 2024

masterleinad Feb 16, 2024

masterleinad Mar 6, 2024

masterleinad Feb 23, 2024

cedricchevalier19 Feb 23, 2024

masterleinad Mar 6, 2024

masterleinad Mar 6, 2024

masterleinad Mar 6, 2024

masterleinad Mar 6, 2024

crtrott Mar 6, 2024

masterleinad Mar 6, 2024

masterleinad Mar 6, 2024

crtrott Mar 6, 2024

masterleinad Mar 6, 2024

masterleinad left a comment

crtrott commented Mar 14, 2024

		#define GTEST_SKIP_IF_UNIFIED_MEMORY_SPACE \
		GTEST_SKIP() << "skipping since unified memory requires additional fences";

	// This is inteded to simulate Grace-Hopper like behavior
	// This is intended to simulate Grace-Hopper-like behavior

	// The idea is to use host allocator and then adivce to keep it in HBM on
	// The idea is to use a host allocator and then advise to keep it in HBM on the

Add support for CUDA unified memory architectures i.e. Grace Hopper #6823

Are you sure you want to change the base?

Add support for CUDA unified memory architectures i.e. Grace Hopper #6823

Conversation

crtrott commented Feb 16, 2024 • edited

masterleinad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad left a comment

Choose a reason for hiding this comment

crtrott commented Mar 14, 2024

crtrott commented Feb 16, 2024 •

edited