Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we use systolic arrays in this implementation? #1404

Open
chsasank opened this issue Mar 13, 2024 · 9 comments
Open

Can we use systolic arrays in this implementation? #1404

chsasank opened this issue Mar 13, 2024 · 9 comments
Labels
discussion General discussion about something

Comments

@chsasank
Copy link
Contributor

Congrats on your release!

I am wondering if your implementation allows me to use systolic arrays (tensor cores or xmx engines or matrix cores in diff gpu implementations)? Intel's implementation has this extension to support these from all vendors.

I think having support for these is very important for performance of matrix multiplication and highly quantized implementations.

@chsasank chsasank added the discussion General discussion about something label Mar 13, 2024
@illuhad
Copy link
Collaborator

illuhad commented Mar 13, 2024

I don't think we have a "nice interface" for this functionality, but it should be possible to call backend builtins for this purpose if backends have them. How exactly this would look like depends on the compilation flow you want to use.

I have not yet looked at the Intel extension in depth, but in my experience Intel extensions are not always designed with the goal of being implementable by other implementations. So it's possible we might need a different interface.

If you want to work on building this "nice interface", I can provide further guidance :)

@chsasank
Copy link
Contributor Author

chsasank commented Mar 13, 2024 via email

@illuhad
Copy link
Collaborator

illuhad commented Mar 13, 2024

That's great :)

Firstly be aware that AdaptiveCpp supports multiple compilation flows as outlined here: https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/compilation.md

The most important one is the generic single-pass compiler, which is our default and most powerful compiler. So it would probably be a good idea to focus on that one initially.

For the generic compiler, in the best case, we won't need any new compiler magic.
The way that the generic compiler handles backend-specific things at the moment is to have a unified builtin interface that is implemented for the various backends in backend-specific bitcode libraries. The higher-level C++ in the SYCL headers can then just call into this builtin interface.

For example, this file defines the available math builtins: https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/include/hipSYCL/sycl/libkernel/sscp/builtins/math.hpp
And this is then implemented in the math.cpp files for the various backends in the subdirectories here:
https://github.com/AdaptiveCpp/AdaptiveCpp/tree/develop/src/libkernel/sscp
Inside the *.cpp files, you can then use target-specific compiler builtins or even inline assembly as needed.

Ideally we could implement systolic arrays similarly. This would require to first come up with a suitable, unified builtin interface API. For this purpose, it might be useful to do a small survey into how systolic arrays work on the various backends - what do the compiler builtins/instructions look like, which assumptions do they make etc. It might also be useful to look at how DPC++ maps this to hardware.

Once we have the builtin API, we could then add implementations for the backends, and then lastly, add a pretty C++ interface to include/hipSYCL/sycl which calls the builtins.

There are probably some nitty-gritty details down the line that we would need to discuss at some point. For example, what exactly should happen if some hardware does not support this functionality natively? DPC++ seems to throw an exception in that case, but perhaps it might be more useful to use a software fallback in that case in order to simplify the development of portable code. But I think initially we should not overcomplicate things and can just assume that the hardware has this functionality. We need to start from somewhere :)

We might also need some runtime bits (e.g. to query hardware capabilities and expose that information to users), but that would similarly not be critical initially, I think.

@chsasank
Copy link
Contributor Author

chsasank commented Mar 13, 2024

Let me start with the easy bit

Ideally we could implement systolic arrays similarly. This would require to first come up with a suitable, unified builtin interface API. For this purpose, it might be useful to do a small survey into how systolic arrays work on the various backends - what do the compiler builtins/instructions look like, which assumptions do they make etc. It might also be useful to look at how DPC++ maps this to hardware.

Nvidia tensor core programming:

nvcuda::wmma::load_matrix_sync(a_frag, matrix_mma_a_mptr, lda);
nvcuda::wmma::load_matrix_sync(b_frag, matrix_mma_b_mptr, ldb);

// Perform the matrix multiplication
nvcuda::wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);

AMD example:

// Load the inputs
rocwmma::load_matrix_sync(fragA, a + (cRow * lda + i), lda);
rocwmma::load_matrix_sync(fragB, b + (i + cCol * ldb), ldb);

// Matrix multiply - accumulate using MFMA units
rocwmma::mma_sync(fragAcc, fragA, fragB, fragAcc);

Intel example

joint_matrix_load(
    sg, sub_a,
    accA.template get_multi_ptr<sycl::access::decorated::no>() +
        (sg_startx * TM) * K + k * TK,
    K);
joint_matrix_load(
    sg, sub_b,
    accB.template get_multi_ptr<sycl::access::decorated::no>() +
        (k * TK / 2) * (N * 2) + sg_starty / SG_SZ * TN * 2,
    N * 2);
joint_matrix_mad(sg, sub_c, sub_a, sub_b, sub_c);

Generally speaking the interface is fairly standard across all the three vendors. I have just dived deep into Intel's version so I can explain how that works. In each execution unit, there is support for both vector instructions and matrix instructions - they both share the register file. We load small matrices (A = 8x16 B = 16x8 => C = 8 x 8) into registers and do dot product accumulate. This can then be used to implement many high level DNN related kernels.

it might be more useful to use a software fallback in that case in order to simplify the development of portable code

I 100% agree. In case of missing systolic arrays, we should implement a fairly high performant version using vector instructions. BLIS people call it Microkernel and here's how it looks for a CPU: actual version, dumbed down version

I will look into DPC++ implementation and report back on details from here.

@chsasank
Copy link
Contributor Author

Here's some information about DPC++ implementation. Let's look at joint_matrix_mad from matrix-unified.hpp

joint_matrix_mad(
    Group,
    joint_matrix<Group, Td, use::accumulator, M, N,
                 sycl::ext::oneapi::experimental::matrix::layout::dynamic> &D,
    const joint_matrix<Group, Ta, use::a, M, K, LayoutA> &A,
    const joint_matrix<Group, Tb, use::b, K, N, LayoutB> &B,
    const joint_matrix<Group, Tc, use::accumulator, M, N,
                       sycl::ext::oneapi::experimental::matrix::layout::dynamic>
        &C) {
...
        #if defined(__NVPTX__)
  if constexpr (std::is_same<Ta, Tb>::value) {
    sycl::ext::oneapi::detail::joint_matrix_mad_cuda<Ta, Tc, Td, M, K, N,
                                                     LayoutA, LayoutB>(
        D.matrix_impl, A.matrix_impl, B.matrix_impl, C.matrix_impl);
  }
...
#elif defined(__HIP_PLATFORM_AMD_MFMA__)
  if constexpr (std::is_same<Ta, Tb>::value) {
    sycl::ext::oneapi::detail::joint_matrix_mad_hip<Ta, Tc, M, K, N, LayoutA,
                                                    LayoutB>(
        D.matrix_impl, A.matrix_impl, B.matrix_impl, C.matrix_impl);
...
     #else
  if constexpr (std::is_same<Ta, uint16_t>::value &&
                std::is_same<Tb, uint16_t>::value &&
                std::is_same<Tc, float>::value)
    D.spvm = __spirv_JointMatrixMadINTEL(A.spvm, B.spvm, C.spvm);
....

Nvidia's version (joint_matrix_mad_cuda) is here. Uses builtin like this:

__imma_m16n16k16_mma_s8(ptrD, ptrA, ptrB, ptrC,
                                get_layout_pair_id<LayoutA, LayoutB>(), 0);

AMD's version (joint_matrix_mad_hip) is here. Uses builtin like this:

auto result = __builtin_amdgcn_mfma_f32_16x16x16f16(
          *reinterpret_cast<const float16x4 *>(&A.wi_marray),
          *reinterpret_cast<const float16x4 *>(&B.wi_marray),
          *reinterpret_cast<const floatx4 *>(&C.wi_marray), 0, 0, 0);
      std::memcpy(&D.wi_marray, &result, 4 * sizeof(float));

Intel's Spir-v version is defined in OpenCL spirv ops.

    D.spvm = __spirv_JointMatrixMadINTEL(A.spvm, B.spvm, C.spvm);

So, it looks like all the 3 versions uses built ins. So we don't have to do any compiler shenanigans?

@illuhad
Copy link
Collaborator

illuhad commented Mar 13, 2024

Thanks for the investigation!

So, it looks like all the 3 versions uses built ins. So we don't have to do any compiler shenanigans?

Yep, looks like it :)

Now we need to figure out which kind of interface will be useful to cover all three.

@chsasank
Copy link
Contributor Author

Our interface needs to

  1. Create sub_matrices/fragments to hold the data
  2. load instruction to load data from a array pointer to these fragments
  3. mad instruction to do matrix multiply add on sub matrices/fragments
  4. store instruction to retrieve the data from sub matrices
  5. May be a fill instruction to fill sub matrices with constants

Different GPUs support different submatrix sizes, dtypes and memory layouts. So we need to figure out a way to minimize the code duplication there.

@illuhad
Copy link
Collaborator

illuhad commented Mar 18, 2024

Thanks for the analysis :)

I think perhaps it might be a good idea to start on the data types that the interface is supposed to operate on.

Do we need an opaque data type for the matrices? Can backends agree on a matrix type size at least?
For example, with half we currently effectively represent it as uint16 in the builtin interface. Builtin implementations then bitcast this to _Float16 or whatever type internally works for the given LLVM backend, as not all support _Float16.

@chsasank
Copy link
Contributor Author

Sorry for late reply. Have been traveling.

Can backends agree on a matrix type size at least?

I think so - at least according to Intel's extension. All the supported combinations are documented here. We do need to allow for different layouts - column major, row major and weird one called VNNI for Intel.

There might be some data types supported by some of the GPUs but are not supported by builtin data types - like int4/int2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion General discussion about something
Projects
None yet
Development

No branches or pull requests

2 participants