Add skew-symmetric BLAS operations #805

devinamatthews · 2024-04-26T22:59:09Z

This PR adds a number of level-2 and level-3 skew-symmetric (and skew-hermitian) BLAS operations, defining the essential operations of a "Skew-BLAS" interface. These operations have been added as full 1st-class citizens of the BLIS API complete with testsuite and mixed-precision/mixed-domain support (level-3 only).

This operation negates a scalar (both real and imaginary parts).

- Add `bli_?negs/bli_?negris` to negate a scalar. - Add `bli_?setr0s` to zero out only the real part of a complex scalar. - Add (void) to silence unused variable warnings in several level-0 macros.

Add `mkskewsymm` and `mkskewherm` operations to explicit skew-symmetrize or skew-hermitize a matrix. For a skew-symmetric matrix, the diagonal is explicitly set to zero, while for a skew-hermitian matrix the real part of the diagonal is set to zero.

Add `BLIS_SKEW_SYMMETRIC` and `BLIS_SKEW_HERMITIAN` matrix structures along with associated help functions and macros. Note that this requires increasing the number of bits used to represent a `struc_t` in the `obj_t::info` member. A compile-time check has also been added to prevent against accidental bit overflow in the future.

This operation sets only the real part of a matrix diagonal to the given value.

The function signature for dotaxpyf has been changed to allow different `alpha` values for the dot and axpy sub-problems. This is needed to support skew-symmetric operations which differ in more than just conjugation of A and A^T.

Add `skmv` (skew-symmetric matrix times vector), `shmv` (skew-hermitian matrix times vector), `skr2` (skew-symmetric rank-2 update), and `shr2` (skew-hermitian rank-2 update) operations. Note that a rank-1 skew-symmetric update is not possible, and a rank-1 skew-hermitian update is not particularly useful.

The reference packing kernels have been updated to support skew-symmetric and skew-hermitian matrix structures. No updates to the dense reference packing kernel (`bli_?packm_ckx_<arch>_ref`) or to any optimized packing kernels, since `bli_?packm_struc_cxk` handles the negation of the unstored region by modifying `kappa`.

Add `skmm` (skew-symmetric matrix times dense matrix), `shmm` (skew-hermitian matrix times dense matrix), `skr2k` (skew-symmetric rank-2k update), and `shr2k` (skew-hermitian rank-2k update) operations. Note that a rank-k skew-symmetric update is not possible, and a rank-k skew-hermitian update is not particularly useful.

[ci skip]

devinamatthews · 2024-04-26T23:02:27Z

@myeh01 @nick-knight @Aaron-Hutchinson can the SiFive team please review commit b986782? I had to delve into the RISC-V assembly there and I'm only ~80% sure I did it right.

devinamatthews · 2024-04-26T23:03:06Z

@fgvanzee again the Travis CI build failed to trigger...

fgvanzee · 2024-04-26T23:05:03Z

@fgvanzee again the Travis CI build failed to trigger...

I don't remember if there was anything we could do to fix it on our end. Might be that we just have to wait and then make a dummy commit to try to trigger again?

devinamatthews · 2024-04-26T23:06:57Z

I don't remember either but it is annoying

devinamatthews · 2024-04-26T23:10:56Z

There we go

nick-knight · 2024-04-26T23:11:34Z

@devinamatthews Confirming I got your message. It looks like the register allocation in dotxaxpyf changed: thanks for taking a stab at it. I think @myeh01 wrote this one, I'll ask him to review it. (Michael, it looks like what's changed is the microkernel now has separate alpha values for the "dot" and "axpy" parts.)

devinamatthews · 2024-04-27T04:47:41Z

Travis CI failed for x280 so I guess I did do something wrong.

myeh01 · 2024-04-27T08:02:52Z

Running the testsuite, it looks like s and d are passing while c and z are failing. For c and z, when I change the register allocation from fa4, fa5 to fa0, fa1 for alphaw, it passes the dotxaxpyf tests locally, so I think Devin modified the code correctly. My guess is the inline assembly I wrote is not generating the code I want it to (maybe the compiler is overwriting some floating-point registers in between asm blocks). I'll need some time to study the objdump to see what's going on.

myeh01 · 2024-04-29T10:57:40Z

After looking at the objdump, it looks like the compiler is using fa4 and fa5 in some branches involving floating point comparisons. To prevent this issue from reappearing, I would like to go back through all the inline assembly files I wrote and replace any manually allocated floating point (and integer) registers with C variables. I should be able to get it done this week. What would be the best way to get these changes merged into the repo? Should I submit a PR to this branch or master?

nick-knight · 2024-04-29T17:56:23Z

@myeh01 A quicker fix might be to add clobbers. We should be using these anyway whenever we use inline asm with explicit register allocation, whether X-, F-, or V-registers.

Going the other direction, I think we might be better off using generic C for all the scalar stuff, and only using inline asm for the vector stuff (when intrinsics don't suffice). I don't think we'll lose much in performance, and it would make the code much more maintainable and retargetable.

myeh01 · 2024-04-29T20:11:40Z

@nick-knight I originally tried just adding the output register to the clobber list of any floating-point load, e.g.

__asm__(FLT_LOAD "fa5, %1(%0)" : : "r"(alphaw), "I"(FLT_SIZE) : "fa5");

But the compiler still uses fa5 for floating-point comparisons. Does the clobber only apply to that block of inline asm? Or maybe I did this incorrectly and should add clobbers in other places as well. There's also this, but then we would have to use C variables anyways.

Edit: Yeah, I think replacing the scalar stuff with generic C would be more robust. I started doing this for some parts of cdotxaxpyf and it wasn't too hard.

Edit 2: After looking through some of the code again, I think I would also like to replace some inline asm code with intrinsics.

nick-knight · 2024-04-29T20:38:41Z

Does the clobber only apply to that block of inline asm?

Correct. If you don't want the compiler to overwrite fa5 between this asm statement and a subsequent one, you'll have to merge them into the same statement. This tends to snowball, necessitating rewriting conditionals and loops in assembly, etc.

There's also this, but then we would have to use C variables anyways.

Correct.

I think replacing the scalar stuff with generic C would be more robust.

I agree.

Our code still has lurking risks related to our explicit allocation of V-registers: we are trusting the compiler not to generate any vector code between each pair of asm statements with a RAW dependence through a V-register. Hardening this will probably snowball in the manner described above, so I don't propose we attempt to tackle it here.

myeh01 · 2024-04-29T21:54:50Z

@devinamatthews How would you like to proceed? There are a few short-term solutions we discussed above. Longer-term, I'd like to rewrite the inline assembly files to be more robust (probably using intrinsics where it won't significantly impact performance). Please let me know what I can do to help.

nick-knight · 2024-04-29T22:31:52Z

One perspective is that this ukernel interface change exposed a bug in SiFive's implementation of the legacy ukernel interface. To proceed, the defective ukernel implementation could be deleted in this PR --- reverting to a generic implementation --- and then reintroduced, upgraded and corrected, in a subsequent PR. This might be the cleanest way forward.

devinamatthews · 2024-04-29T22:45:43Z

I'm not in a huge rush to merge. If it takes say a month or less to fix it properly then I can wait. Otherwise yes we could revert to generic and fix later. This wouldn't require deleting anything, just commenting out the kernel registration.

myeh01 · 2024-04-30T00:08:07Z

Mind if we sync up in a week or two? I'll start working on it this week and hopefully by then I'll have a sense of how much more time it would take.

devinamatthews · 2024-04-30T04:00:53Z

Sure.

myeh01 · 2024-05-10T23:11:39Z

@devinamatthews I'm steadily working through cleaning up all the kernels, but I don't think I'll be able to finish it in the next two weeks. I'm also trying to balance this work with other projects I need to work on, so it may take a few more weeks. It may be best to follow Nick's suggestion and temporarily disable the ~~sifive_x280 config~~ failing microkernel (sorry, misread Nick's comment form earlier) until the clean up is complete.

nick-knight · 2024-05-10T23:14:37Z

It may be best to follow Nick's suggestion and temporarily disable the sifive_x280 config until the clean up is complete.

I don't propose disabling the whole configuration, just removing the one ukernel that's causing issues. IIUC, this will cause BLIS to default to a generic/reference implementation.

devinamatthews · 2024-05-14T19:47:58Z

@myeh01 We've decided not to include this PR in the next release so there's not much time pressure.

devinamatthews added 10 commits April 25, 2024 21:45

Add a negsc level-0 scalar operation.

6740b1f

This operation negates a scalar (both real and imaginary parts).

Add/modify level-0 scalar macros.

9c0cec8

- Add `bli_?negs/bli_?negris` to negate a scalar. - Add `bli_?setr0s` to zero out only the real part of a complex scalar. - Add (void) to silence unused variable warnings in several level-0 macros.

Add a setrd level-1d operation.

3a52b71

This operation sets only the real part of a matrix diagonal to the given value.

Change dotaxpyf microkernel signature.

b986782

The function signature for dotaxpyf has been changed to allow different `alpha` values for the dot and axpy sub-problems. This is needed to support skew-symmetric operations which differ in more than just conjugation of A and A^T.

Typo fix in configure

06829d0

[ci skip]

devinamatthews requested a review from fgvanzee April 26, 2024 23:02

Trigger CI build

129ce04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add skew-symmetric BLAS operations #805

Add skew-symmetric BLAS operations #805

devinamatthews commented Apr 26, 2024 •

edited

devinamatthews commented Apr 26, 2024

devinamatthews commented Apr 26, 2024

fgvanzee commented Apr 26, 2024

devinamatthews commented Apr 26, 2024

devinamatthews commented Apr 26, 2024

nick-knight commented Apr 26, 2024

devinamatthews commented Apr 27, 2024

myeh01 commented Apr 27, 2024

myeh01 commented Apr 29, 2024

nick-knight commented Apr 29, 2024 •

edited

myeh01 commented Apr 29, 2024 •

edited

nick-knight commented Apr 29, 2024

myeh01 commented Apr 29, 2024

nick-knight commented Apr 29, 2024 •

edited

devinamatthews commented Apr 29, 2024

myeh01 commented Apr 30, 2024

devinamatthews commented Apr 30, 2024

myeh01 commented May 10, 2024 •

edited

nick-knight commented May 10, 2024

devinamatthews commented May 14, 2024

Add skew-symmetric BLAS operations #805

Are you sure you want to change the base?

Add skew-symmetric BLAS operations #805

Conversation

devinamatthews commented Apr 26, 2024 • edited

devinamatthews commented Apr 26, 2024

devinamatthews commented Apr 26, 2024

fgvanzee commented Apr 26, 2024

devinamatthews commented Apr 26, 2024

devinamatthews commented Apr 26, 2024

nick-knight commented Apr 26, 2024

devinamatthews commented Apr 27, 2024

myeh01 commented Apr 27, 2024

myeh01 commented Apr 29, 2024

nick-knight commented Apr 29, 2024 • edited

myeh01 commented Apr 29, 2024 • edited

nick-knight commented Apr 29, 2024

myeh01 commented Apr 29, 2024

nick-knight commented Apr 29, 2024 • edited

devinamatthews commented Apr 29, 2024

myeh01 commented Apr 30, 2024

devinamatthews commented Apr 30, 2024

myeh01 commented May 10, 2024 • edited

nick-knight commented May 10, 2024

devinamatthews commented May 14, 2024

devinamatthews commented Apr 26, 2024 •

edited

nick-knight commented Apr 29, 2024 •

edited

myeh01 commented Apr 29, 2024 •

edited

nick-knight commented Apr 29, 2024 •

edited

myeh01 commented May 10, 2024 •

edited