Can we use systolic arrays in this implementation? #1404

chsasank · 2024-03-13T06:27:59Z

Congrats on your release!

I am wondering if your implementation allows me to use systolic arrays (tensor cores or xmx engines or matrix cores in diff gpu implementations)? Intel's implementation has this extension to support these from all vendors.

I think having support for these is very important for performance of matrix multiplication and highly quantized implementations.

illuhad · 2024-03-13T14:08:34Z

I don't think we have a "nice interface" for this functionality, but it should be possible to call backend builtins for this purpose if backends have them. How exactly this would look like depends on the compilation flow you want to use.

I have not yet looked at the Intel extension in depth, but in my experience Intel extensions are not always designed with the goal of being implementable by other implementations. So it's possible we might need a different interface.

If you want to work on building this "nice interface", I can provide further guidance :)

chsasank · 2024-03-13T15:57:18Z

I agree that we might have to change the extension interface. Intel's joint matrix extension itself is experimental and the API changed this release vs the last. Oh yes, I'd certainly love to help build this. I am somewhat new to compilers but am sure I can pick it up given some documentation. Where do I get started?

…

On Wed, 13 Mar, 2024, 19:38 Aksel Alpay, ***@***.***> wrote: I don't think we have a "nice interface" for this functionality, but it should be possible to call backend builtins for this purpose if backends have them. How exactly this would look like depends on the compilation flow you want to use. I have not yet looked at the Intel extension in depth, but in my experience Intel extensions are not always designed with the goal of being implementable by other implementations. So it's possible we might need a different interface. If you want to work on building this "nice interface", I can provide further guidance :) — Reply to this email directly, view it on GitHub <#1404 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACG76E3PVDVA6LLIBF55HC3YYBMXRAVCNFSM6AAAAABETRGNZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJUGQ4TGMZSHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

illuhad · 2024-03-13T16:47:04Z

That's great :)

Firstly be aware that AdaptiveCpp supports multiple compilation flows as outlined here: https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/compilation.md

The most important one is the generic single-pass compiler, which is our default and most powerful compiler. So it would probably be a good idea to focus on that one initially.

For the generic compiler, in the best case, we won't need any new compiler magic.
The way that the generic compiler handles backend-specific things at the moment is to have a unified builtin interface that is implemented for the various backends in backend-specific bitcode libraries. The higher-level C++ in the SYCL headers can then just call into this builtin interface.

For example, this file defines the available math builtins: https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/include/hipSYCL/sycl/libkernel/sscp/builtins/math.hpp
And this is then implemented in the math.cpp files for the various backends in the subdirectories here:
https://github.com/AdaptiveCpp/AdaptiveCpp/tree/develop/src/libkernel/sscp
Inside the *.cpp files, you can then use target-specific compiler builtins or even inline assembly as needed.

Ideally we could implement systolic arrays similarly. This would require to first come up with a suitable, unified builtin interface API. For this purpose, it might be useful to do a small survey into how systolic arrays work on the various backends - what do the compiler builtins/instructions look like, which assumptions do they make etc. It might also be useful to look at how DPC++ maps this to hardware.

Once we have the builtin API, we could then add implementations for the backends, and then lastly, add a pretty C++ interface to include/hipSYCL/sycl which calls the builtins.

There are probably some nitty-gritty details down the line that we would need to discuss at some point. For example, what exactly should happen if some hardware does not support this functionality natively? DPC++ seems to throw an exception in that case, but perhaps it might be more useful to use a software fallback in that case in order to simplify the development of portable code. But I think initially we should not overcomplicate things and can just assume that the hardware has this functionality. We need to start from somewhere :)

We might also need some runtime bits (e.g. to query hardware capabilities and expose that information to users), but that would similarly not be critical initially, I think.

chsasank · 2024-03-13T17:18:28Z

Let me start with the easy bit

Ideally we could implement systolic arrays similarly. This would require to first come up with a suitable, unified builtin interface API. For this purpose, it might be useful to do a small survey into how systolic arrays work on the various backends - what do the compiler builtins/instructions look like, which assumptions do they make etc. It might also be useful to look at how DPC++ maps this to hardware.

Nvidia tensor core programming:

nvcuda::wmma::load_matrix_sync(a_frag, matrix_mma_a_mptr, lda);
nvcuda::wmma::load_matrix_sync(b_frag, matrix_mma_b_mptr, ldb);

// Perform the matrix multiplication
nvcuda::wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);

AMD example:

// Load the inputs
rocwmma::load_matrix_sync(fragA, a + (cRow * lda + i), lda);
rocwmma::load_matrix_sync(fragB, b + (i + cCol * ldb), ldb);

// Matrix multiply - accumulate using MFMA units
rocwmma::mma_sync(fragAcc, fragA, fragB, fragAcc);

Intel example

joint_matrix_load(
    sg, sub_a,
    accA.template get_multi_ptr<sycl::access::decorated::no>() +
        (sg_startx * TM) * K + k * TK,
    K);
joint_matrix_load(
    sg, sub_b,
    accB.template get_multi_ptr<sycl::access::decorated::no>() +
        (k * TK / 2) * (N * 2) + sg_starty / SG_SZ * TN * 2,
    N * 2);
joint_matrix_mad(sg, sub_c, sub_a, sub_b, sub_c);

Generally speaking the interface is fairly standard across all the three vendors. I have just dived deep into Intel's version so I can explain how that works. In each execution unit, there is support for both vector instructions and matrix instructions - they both share the register file. We load small matrices (A = 8x16 B = 16x8 => C = 8 x 8) into registers and do dot product accumulate. This can then be used to implement many high level DNN related kernels.

it might be more useful to use a software fallback in that case in order to simplify the development of portable code

I 100% agree. In case of missing systolic arrays, we should implement a fairly high performant version using vector instructions. BLIS people call it Microkernel and here's how it looks for a CPU: actual version, dumbed down version

I will look into DPC++ implementation and report back on details from here.

chsasank · 2024-03-13T18:29:15Z

Here's some information about DPC++ implementation. Let's look at joint_matrix_mad from matrix-unified.hpp

joint_matrix_mad(
    Group,
    joint_matrix<Group, Td, use::accumulator, M, N,
                 sycl::ext::oneapi::experimental::matrix::layout::dynamic> &D,
    const joint_matrix<Group, Ta, use::a, M, K, LayoutA> &A,
    const joint_matrix<Group, Tb, use::b, K, N, LayoutB> &B,
    const joint_matrix<Group, Tc, use::accumulator, M, N,
                       sycl::ext::oneapi::experimental::matrix::layout::dynamic>
        &C) {
...
        #if defined(__NVPTX__)
  if constexpr (std::is_same<Ta, Tb>::value) {
    sycl::ext::oneapi::detail::joint_matrix_mad_cuda<Ta, Tc, Td, M, K, N,
                                                     LayoutA, LayoutB>(
        D.matrix_impl, A.matrix_impl, B.matrix_impl, C.matrix_impl);
  }
...
#elif defined(__HIP_PLATFORM_AMD_MFMA__)
  if constexpr (std::is_same<Ta, Tb>::value) {
    sycl::ext::oneapi::detail::joint_matrix_mad_hip<Ta, Tc, M, K, N, LayoutA,
                                                    LayoutB>(
        D.matrix_impl, A.matrix_impl, B.matrix_impl, C.matrix_impl);
...
     #else
  if constexpr (std::is_same<Ta, uint16_t>::value &&
                std::is_same<Tb, uint16_t>::value &&
                std::is_same<Tc, float>::value)
    D.spvm = __spirv_JointMatrixMadINTEL(A.spvm, B.spvm, C.spvm);
....

Nvidia's version (joint_matrix_mad_cuda) is here. Uses builtin like this:

__imma_m16n16k16_mma_s8(ptrD, ptrA, ptrB, ptrC,
                                get_layout_pair_id<LayoutA, LayoutB>(), 0);

AMD's version (joint_matrix_mad_hip) is here. Uses builtin like this:

auto result = __builtin_amdgcn_mfma_f32_16x16x16f16(
          *reinterpret_cast<const float16x4 *>(&A.wi_marray),
          *reinterpret_cast<const float16x4 *>(&B.wi_marray),
          *reinterpret_cast<const floatx4 *>(&C.wi_marray), 0, 0, 0);
      std::memcpy(&D.wi_marray, &result, 4 * sizeof(float));

Intel's Spir-v version is defined in OpenCL spirv ops.

    D.spvm = __spirv_JointMatrixMadINTEL(A.spvm, B.spvm, C.spvm);

So, it looks like all the 3 versions uses built ins. So we don't have to do any compiler shenanigans?

illuhad · 2024-03-13T20:02:40Z

Thanks for the investigation!

So, it looks like all the 3 versions uses built ins. So we don't have to do any compiler shenanigans?

Yep, looks like it :)

Now we need to figure out which kind of interface will be useful to cover all three.

chsasank · 2024-03-14T02:23:08Z

Our interface needs to

Create sub_matrices/fragments to hold the data
load instruction to load data from a array pointer to these fragments
mad instruction to do matrix multiply add on sub matrices/fragments
store instruction to retrieve the data from sub matrices
May be a fill instruction to fill sub matrices with constants

Different GPUs support different submatrix sizes, dtypes and memory layouts. So we need to figure out a way to minimize the code duplication there.

illuhad · 2024-03-18T18:06:58Z

Thanks for the analysis :)

I think perhaps it might be a good idea to start on the data types that the interface is supposed to operate on.

Do we need an opaque data type for the matrices? Can backends agree on a matrix type size at least?
For example, with half we currently effectively represent it as uint16 in the builtin interface. Builtin implementations then bitcast this to _Float16 or whatever type internally works for the given LLVM backend, as not all support _Float16.

chsasank · 2024-03-24T14:43:38Z

Sorry for late reply. Have been traveling.

Can backends agree on a matrix type size at least?

I think so - at least according to Intel's extension. All the supported combinations are documented here. We do need to allow for different layouts - column major, row major and weird one called VNNI for Intel.

There might be some data types supported by some of the GPUs but are not supported by builtin data types - like int4/int2.

chsasank added the discussion General discussion about something label Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we use systolic arrays in this implementation? #1404

Can we use systolic arrays in this implementation? #1404

chsasank commented Mar 13, 2024

illuhad commented Mar 13, 2024

chsasank commented Mar 13, 2024 via email

illuhad commented Mar 13, 2024

chsasank commented Mar 13, 2024 •

edited

chsasank commented Mar 13, 2024

illuhad commented Mar 13, 2024

chsasank commented Mar 14, 2024

illuhad commented Mar 18, 2024 •

edited

chsasank commented Mar 24, 2024

Can we use systolic arrays in this implementation? #1404

Can we use systolic arrays in this implementation? #1404

Comments

chsasank commented Mar 13, 2024

illuhad commented Mar 13, 2024

chsasank commented Mar 13, 2024 via email

illuhad commented Mar 13, 2024

chsasank commented Mar 13, 2024 • edited

chsasank commented Mar 13, 2024

illuhad commented Mar 13, 2024

chsasank commented Mar 14, 2024

illuhad commented Mar 18, 2024 • edited

chsasank commented Mar 24, 2024

chsasank commented Mar 13, 2024 •

edited

illuhad commented Mar 18, 2024 •

edited