Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kokkos_malloc should accept an execution space instance #6918

Open
romintomasetti opened this issue Apr 4, 2024 · 3 comments
Open

kokkos_malloc should accept an execution space instance #6918

romintomasetti opened this issue Apr 4, 2024 · 3 comments
Labels
Enhancement Improve existing capability; will potentially require voting Question For Kokkos internal and external contributors and users

Comments

@romintomasetti
Copy link
Contributor

Summary

Kokkos::kokkos_malloc can be used to allocate memory in a memory space, allowing the memory to be tracked by Kokkos (among other benefits such as label).

However, it seems it is not possible to pass an execution space instance (that would allow one to do stream ordered allocation).

Currently, the code looks like:

template <class Space = Kokkos::DefaultExecutionSpace::memory_space>
inline void* kokkos_malloc(const std::string& arg_alloc_label,
const size_t arg_alloc_size) {
using MemorySpace = typename Space::memory_space;
return Impl::SharedAllocationRecord<MemorySpace>::allocate_tracked(
MemorySpace(), arg_alloc_label, arg_alloc_size);
}
template <class Space = Kokkos::DefaultExecutionSpace::memory_space>
inline void* kokkos_malloc(const size_t arg_alloc_size) {
using MemorySpace = typename Space::memory_space;
return Impl::SharedAllocationRecord<MemorySpace>::allocate_tracked(
MemorySpace(), "no-label", arg_alloc_size);
}

Actions

  1. Add overloads that allow an execution space instance to be provided.
  2. Rename Kokkos::kokkos_malloc to Kokkos::malloc ?

Joint work with @maartenarnst while thinking of the requirements for a copy-to-device helper, that helps users generating the vtable on device.

@romintomasetti
Copy link
Contributor Author

@dalg24 @masterleinad @crtrott What do you think ?

@masterleinad
Copy link
Contributor

Do you have a good motivation for using kokkos_malloc over a Kokkos::View allocation? How much does it matter to allocate on the correct instance? Note that we are fencing when allocating anyway to make sure the allocation has happened when the call returns.
The problem with kokkos_malloc is that some metadata is lost. Thus, Kokkos::View works in a multi-GPU setup out of the box while kokkos_malloc doesn't. Of course, you also have to manage the allocation manually.
From a consistency point of view, it might make sense to add these overloads but in most situations, I would discourage people from using these raw allocation mechanisms. Note that you can already call the memory spaces allocate member functions directly.

@romintomasetti
Copy link
Contributor Author

Motivation - Runtime polymorphism on device

Do you have a good motivation for using kokkos_malloc over a Kokkos::View allocation?

The motivation is mostly related to creating vtables on device to allow for dynamic polymorphism. Just a few references:

Basically, a generic code that generates the vtable could look like the following (if you follow the presentation of @vbrunini):

/// A custom deleter that can be used in e.g. @c std::unique_ptr or @c std::shared_ptr.
/// Inspired by V. Brunini.
template <typename device_type>
struct DeviceDeleter
{
    template <typename T>
    void operator()(T* ptr) const
    {
        Kokkos::parallel_for(Kokkos::RangePolicy<typename device_type::execution_space>(0, 1),
                             KOKKOS_LAMBDA (const int /* */) {ptr->~T();});

        Kokkos::kokkos_free<typename device_type::memory_space>(ptr);
    }
};

/// Copy a host object to device with a placement new calling the copy constructor, thereby creating the @c vtable on device.
/// Inspired by V. Brunini.
template <
    typename Derived,
    typename device_type,
    typename smart_ptr_t = std::shared_ptr<Derived>
>
smart_ptr_t copy_to_device(const typename device_type::execution_space& space, const Derived& derived)
{
    auto* ptr = static_cast<Derived*>(Kokkos::kokkos_malloc<typename device_type::memory_space>(sizeof(Derived)));

    Kokkos::parallel_for(Kokkos::RangePolicy<typename device_type::execution_space>(space, 0, 1),
                         KOKKOS_LAMBDA (const int /* */) {new (ptr) Derived(derived);});

    return smart_ptr_t(ptr, DeviceDeleter<execution_space>());
}

If you use a rank-0 view as @masterleinad suggested, you might have a code like this:

/// Same as @ref copy_to_device, but using a rank-0 @c Kokkos::View.
template <typename Derived, typename device_type, typename view_t = Kokkos::View<Derived, device_type>>
view_t copy_to_device_in_view(const typename device_type::execution_space& space, const Derived& derived)
{
    view_t copied(Kokkos::view_alloc(space, "label"));//, Kokkos::WithoutInitializing));

    Kokkos::parallel_for(Kokkos::RangePolicy<typename device_type::execution_space>(space, 0, 1),
                         KOKKOS_LAMBDA (const int /* */) {new (copied.data()) Derived(derived);});

    return copied;
}

It is indeed shorter to use a rank-0 view for that purpose.

However, we could not use Kokkos::WithoutInitializing while instantiating the rank-0 view (to save the default construction before we do the placement new). Indeed, if you pass Kokkos::WithoutInitializing, the destructor is not called. We think the culprit is laying somewhere around these lines:

// Only initialize if the allocation is non-zero.
// May be zero if one of the dimensions is zero.
if constexpr (alloc_prop::initialize)
if (alloc_size) {
// Assume destruction is only required when construction is requested.
// The ViewValueFunctor has both value construction and destruction
// operators.
record->m_destroy = std::move(functor);
// Construct values
record->m_destroy.construct_shared_allocation();
}

Allocating on the right stream

How much does it matter to allocate on the correct instance? Note that we are fencing when allocating anyway to make sure the allocation has happened when the call returns.

I guess the motivation is that, if you can't pass an execution space instance to Kokkos::kokkos_malloc, the allocation will be placed on the stream attached to the memory space instance that is default-constructed in Kokkos::kokkos_malloc, which is inherited from the default space instance. Therefore, using Kokkos::kokkos_malloc might place the user in a "waiting" situation - until the work on the stream attached to the default space instance is done.

The problem with kokkos_malloc is that some metadata is lost. Thus, Kokkos::View works in a multi-GPU setup out of the box while kokkos_malloc doesn't.

  • I'm curious which metadata are lost 😉
  • I'm guessing the reason it would not work in a multi-GPU context for now is that, as I mentioned above, the default memory space instance inherits the stream and device ID of the default execution space instance. Just a guess 😄

Of course, you also have to manage the allocation manually. From a consistency point of view, it might make sense to add these overloads but in most situations, I would discourage people from using these raw allocation mechanisms. Note that you can already call the memory spaces allocate member functions directly.

  • Avoid manual management: 💯% agreed!

@ajpowelsnl ajpowelsnl added Enhancement Improve existing capability; will potentially require voting Question For Kokkos internal and external contributors and users labels Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Improve existing capability; will potentially require voting Question For Kokkos internal and external contributors and users
Projects
None yet
Development

No branches or pull requests

3 participants