`kokkos_malloc` should accept an execution space instance #6918

romintomasetti · 2024-04-04T14:42:23Z

Summary

Kokkos::kokkos_malloc can be used to allocate memory in a memory space, allowing the memory to be tracked by Kokkos (among other benefits such as label).

However, it seems it is not possible to pass an execution space instance (that would allow one to do stream ordered allocation).

Currently, the code looks like:

kokkos/core/src/Kokkos_Core.hpp

Lines 152 to 165 in a833fb0

    
           template <class Space = Kokkos::DefaultExecutionSpace::memory_space> 
        
           inline void* kokkos_malloc(const std::string& arg_alloc_label, 
        
                                      const size_t arg_alloc_size) { 
        
             using MemorySpace = typename Space::memory_space; 
        
             return Impl::SharedAllocationRecord<MemorySpace>::allocate_tracked( 
        
                 MemorySpace(), arg_alloc_label, arg_alloc_size); 
        
           } 
        
           template <class Space = Kokkos::DefaultExecutionSpace::memory_space> 
        
           inline void* kokkos_malloc(const size_t arg_alloc_size) { 
        
             using MemorySpace = typename Space::memory_space; 
        
             return Impl::SharedAllocationRecord<MemorySpace>::allocate_tracked( 
        
                 MemorySpace(), "no-label", arg_alloc_size); 
        
           }

Actions

Add overloads that allow an execution space instance to be provided.
Rename Kokkos::kokkos_malloc to Kokkos::malloc ?

Joint work with @maartenarnst while thinking of the requirements for a copy-to-device helper, that helps users generating the vtable on device.

The text was updated successfully, but these errors were encountered:

romintomasetti · 2024-04-04T14:43:36Z

@dalg24 @masterleinad @crtrott What do you think ?

masterleinad · 2024-04-04T17:44:04Z

Do you have a good motivation for using kokkos_malloc over a Kokkos::View allocation? How much does it matter to allocate on the correct instance? Note that we are fencing when allocating anyway to make sure the allocation has happened when the call returns.
The problem with kokkos_malloc is that some metadata is lost. Thus, Kokkos::View works in a multi-GPU setup out of the box while kokkos_malloc doesn't. Of course, you also have to manage the allocation manually.
From a consistency point of view, it might make sense to add these overloads but in most situations, I would discourage people from using these raw allocation mechanisms. Note that you can already call the memory spaces allocate member functions directly.

romintomasetti · 2024-04-05T08:39:07Z

Motivation - Runtime polymorphism on device

Do you have a good motivation for using kokkos_malloc over a Kokkos::View allocation?

The motivation is mostly related to creating vtables on device to allow for dynamic polymorphism. Just a few references:

Basically, a generic code that generates the vtable could look like the following (if you follow the presentation of @vbrunini):

/// A custom deleter that can be used in e.g. @c std::unique_ptr or @c std::shared_ptr.
/// Inspired by V. Brunini.
template <typename device_type>
struct DeviceDeleter
{
    template <typename T>
    void operator()(T* ptr) const
    {
        Kokkos::parallel_for(Kokkos::RangePolicy<typename device_type::execution_space>(0, 1),
                             KOKKOS_LAMBDA (const int /* */) {ptr->~T();});

        Kokkos::kokkos_free<typename device_type::memory_space>(ptr);
    }
};

/// Copy a host object to device with a placement new calling the copy constructor, thereby creating the @c vtable on device.
/// Inspired by V. Brunini.
template <
    typename Derived,
    typename device_type,
    typename smart_ptr_t = std::shared_ptr<Derived>
>
smart_ptr_t copy_to_device(const typename device_type::execution_space& space, const Derived& derived)
{
    auto* ptr = static_cast<Derived*>(Kokkos::kokkos_malloc<typename device_type::memory_space>(sizeof(Derived)));

    Kokkos::parallel_for(Kokkos::RangePolicy<typename device_type::execution_space>(space, 0, 1),
                         KOKKOS_LAMBDA (const int /* */) {new (ptr) Derived(derived);});

    return smart_ptr_t(ptr, DeviceDeleter<execution_space>());
}

If you use a rank-0 view as @masterleinad suggested, you might have a code like this:

/// Same as @ref copy_to_device, but using a rank-0 @c Kokkos::View.
template <typename Derived, typename device_type, typename view_t = Kokkos::View<Derived, device_type>>
view_t copy_to_device_in_view(const typename device_type::execution_space& space, const Derived& derived)
{
    view_t copied(Kokkos::view_alloc(space, "label"));//, Kokkos::WithoutInitializing));

    Kokkos::parallel_for(Kokkos::RangePolicy<typename device_type::execution_space>(space, 0, 1),
                         KOKKOS_LAMBDA (const int /* */) {new (copied.data()) Derived(derived);});

    return copied;
}

It is indeed shorter to use a rank-0 view for that purpose.

However, we could not use Kokkos::WithoutInitializing while instantiating the rank-0 view (to save the default construction before we do the placement new). Indeed, if you pass Kokkos::WithoutInitializing, the destructor is not called. We think the culprit is laying somewhere around these lines:

kokkos/core/src/impl/Kokkos_ViewMapping.hpp

Lines 3038 to 3050 in 4b90930

    
           //  Only initialize if the allocation is non-zero. 
        
           //  May be zero if one of the dimensions is zero. 
        
           if constexpr (alloc_prop::initialize) 
        
             if (alloc_size) { 
        
               // Assume destruction is only required when construction is requested. 
        
               // The ViewValueFunctor has both value construction and destruction 
        
               // operators. 
        
               record->m_destroy = std::move(functor); 
        
               // Construct values 
        
               record->m_destroy.construct_shared_allocation(); 
        
             }

Allocating on the right stream

How much does it matter to allocate on the correct instance? Note that we are fencing when allocating anyway to make sure the allocation has happened when the call returns.

I guess the motivation is that, if you can't pass an execution space instance to Kokkos::kokkos_malloc, the allocation will be placed on the stream attached to the memory space instance that is default-constructed in Kokkos::kokkos_malloc, which is inherited from the default space instance. Therefore, using Kokkos::kokkos_malloc might place the user in a "waiting" situation - until the work on the stream attached to the default space instance is done.

The problem with kokkos_malloc is that some metadata is lost. Thus, Kokkos::View works in a multi-GPU setup out of the box while kokkos_malloc doesn't.

I'm curious which metadata are lost 😉
I'm guessing the reason it would not work in a multi-GPU context for now is that, as I mentioned above, the default memory space instance inherits the stream and device ID of the default execution space instance. Just a guess 😄

Of course, you also have to manage the allocation manually. From a consistency point of view, it might make sense to add these overloads but in most situations, I would discourage people from using these raw allocation mechanisms. Note that you can already call the memory spaces allocate member functions directly.

Avoid manual management: 💯% agreed!

ajpowelsnl added Enhancement Improve existing capability; will potentially require voting Question For Kokkos internal and external contributors and users labels Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`kokkos_malloc` should accept an execution space instance #6918

`kokkos_malloc` should accept an execution space instance #6918

romintomasetti commented Apr 4, 2024

romintomasetti commented Apr 4, 2024

masterleinad commented Apr 4, 2024

romintomasetti commented Apr 5, 2024

kokkos_malloc should accept an execution space instance #6918

kokkos_malloc should accept an execution space instance #6918

Comments

romintomasetti commented Apr 4, 2024

Summary

Actions

romintomasetti commented Apr 4, 2024

masterleinad commented Apr 4, 2024

romintomasetti commented Apr 5, 2024

Motivation - Runtime polymorphism on device

Allocating on the right stream

`kokkos_malloc` should accept an execution space instance #6918

`kokkos_malloc` should accept an execution space instance #6918