Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RISCV -mno-strict-align generates problem instructions for some targets #88029

Open
sh1boot opened this issue Apr 8, 2024 · 35 comments
Open

Comments

@sh1boot
Copy link

sh1boot commented Apr 8, 2024

When passed --target=riscv64-linux-gnu -march=rv64gcv -mno-strict-align, clang-18 and up generate vle64 and vse64 instructions for the following code despite receiving byte pointers:

__attribute__((noinline))
void vector_memcpy(uint8_t* d, uint8_t const* p) {
  __builtin_memcpy(d, p, 16);
}

https://godbolt.org/z/eb4oqM3Ts

If the pointer is unaligned then my Kendryte K230 board (C908) will raise a bus error trying to execute this.

However, -mno-strict-align offers big performance gains for some scalar code, where unaligned access is supported just fine. There seems to be no way to specify that vector code should respect alignments while scalar code can do unaligned access.

Scalar code "seems to work" when you cast a byte pointer to a wider type when it's unaligned, but there are other issues with that and most portable code prefers memcpy() with the expectation that the compiler will optimise it correctly.

@llvmbot
Copy link
Collaborator

llvmbot commented Apr 8, 2024

@llvm/issue-subscribers-backend-risc-v

Author: S H (sh1boot)

When passed ` --target=riscv64-linux-gnu -march=rv64gcv -mno-strict-align`, clang-18 and up generate `vle64` and `vse64` instructions for the following code despite receiving byte pointers:
__attribute__((noinline))
void vector_memcpy(uint8_t* d, uint8_t const* p) {
  __builtin_memcpy(d, p, 16);
}

https://godbolt.org/z/eb4oqM3Ts

If the pointer is unaligned then my Kendryte K230 board (C908) will raise a bus error trying to execute this.

However, -mno-strict-align offers big performance gains for some scalar code, where unaligned access is supported just fine. There seems to be no way to specify that vector code should respect alignments while scalar code can do unaligned access.

Scalar code "seems to work" when you cast a byte pointer to a wider type when it's unaligned, but there are other issues with that and most portable code prefers memcpy() with the expectation that the compiler will optimise it correctly.

@topperc
Copy link
Collaborator

topperc commented Apr 8, 2024

Is unaligned scalar really supported in hardware on C908, or is it emulated via a trap and the trap handler doesn't know how to emulate vector?

@sh1boot
Copy link
Author

sh1boot commented Apr 9, 2024

Unaligned scalar loads definitely make my test code run much faster on C908. If it is emulated then it's emulated so efficiently that it's still better than using extra application code to avoid un-aligned loads.

@wangpc-pp
Copy link
Contributor

wangpc-pp commented Apr 9, 2024

Is it the case that unaligned scalar is supported while unaligned vector is not supported?
#73971 cc @preames
I don't remember the details, maybe t-head guys can help to answer this. @zixuan-wu @rjiejie

@sh1boot
Copy link
Author

sh1boot commented Apr 10, 2024

Well on the k230 SOC, unaligned scalar memory access does work and does not have a significant performance impact. It is much faster than loading bytes and assembling them into a word in application code. Enabling this optimisation has a big impact on performance.

Meanwhile, vle8 works at arbitrary memory offsets, and vle64 raises a bus error if given an odd memory address.

Is it supposed to be configurable in the kernel? I don't know. Somebody is shipping kernels configured this way anyway.

Clang-17 currently generates much faster code than clang-18 and clang-19 because the latter cannot emit unaligned scalar access safely.

@appujee
Copy link
Contributor

appujee commented Apr 10, 2024

For llvm, we should have a way to differentiate fast unaligned access support for both scalars and vectors. The current -mno-strict-align is confusing and resulting in bug like this. See google/android-riscv64 for the context.

@asb
Copy link
Contributor

asb commented Apr 11, 2024

Just summarising my understanding of things so people can point out if I have it wrong or disagree:

  • Zicclsm was defined is a mandatory extension in the RVA20 profile and is explicitly specified as indicating support for misaligned load/store for both scalar and vector memory operations. They might execute slowly of course.
    • While the success of RISC-V profiles remains to be seen, I think it's fair to consider that an application that subsets RVA20 is going to be fairly niche going forwards. We should moderate our effort in terms of defining compiler flags etc based on this.
  • Separately there's the issue of whether those loads/stores are fast or not. There was quite a lot of back and forth on this here [RFC] Preprocessor definition to specify data alignment strict(ness) riscv-non-isa/riscv-c-api-doc#32 (which also documented the recommendation that the compiler maps -mno-strict-align as meaning unaligned access is fast) that but it looks like we never really considered the answer being different for vector vs scalar.

It would be useful to know if there is other hardware where scalar misaligned access are fast, but vector slow. If it's basically unheard of, that might indicate different solutions in terms of command-line interface vs if it's likely to happen quite a lot. In the latter case, perhaps we got the interface for riscv-non-isa/riscv-c-api-doc#32 wrong.

@appujee
Copy link
Contributor

appujee commented Apr 11, 2024

Zicclsm was defined is a mandatory extension in the RVA20 profile and is explicitly specified as indicating support for misaligned load/store for both scalar and vector memory operations.

Makes sense to not have compiler flags as the misaligned access support for both scalar and vectors is clearly required for the RVA20 profile. Thanks for pointing to the reference.

@sh1boot
Copy link
Author

sh1boot commented Apr 12, 2024

I believe I tried Zicclsm as part of -march when I was trying to figure out how to enable unaligned codegen (noting problems with uneven handling of -mno-strict-align), and it didn't help; but since I want sure I understood the description I didn't raise a bug.

If that is the same thing then I guess C908 might be considered an RVA20 compliant device if something emulated (or enabled) unaligned vector memory ops by default?

But I think RVA20 isn't telling us if the operation is fast enough to use, right? Just that it shouldn't give a bus error. That's what the command line argument would be for iff C908 became a device that met the required spec or if there were others like it?

@MaoHan001
Copy link

MaoHan001 commented Apr 15, 2024

I believe I tried Zicclsm as part of -march when I was trying to figure out how to enable unaligned codegen (noting problems with uneven handling of -mno-strict-align), and it didn't help; but since I want sure I understood the description I didn't raise a bug.

If that is the same thing then I guess C908 might be considered an RVA20 compliant device if something emulated (or enabled) unaligned vector memory ops by default?

But I think RVA20 isn't telling us if the operation is fast enough to use, right? Just that it shouldn't give a bus error. That's what the command line argument would be for iff C908 became a device that met the required spec or if there were others like it?

C908 supports general unaligned access for vectors, but not element unaligned access in vectors, which should comply with the relevant spec requirements:
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#8-vector-memory-alignment-constraints
Therefore, in our optimizations for memory load/store, we generally handle unaligned logic using elements of length 8, and larger elements are employed when aligned.
Referring to arch/riscv/kernel/traps_misaligned.c, the kernel we are currently using does not seem to handle unaligned vectors.
https://gitee.com/thead-yocto/kernel/blob/master/arch/riscv/kernel/traps_misaligned.c#L240
Based on my impression that vector misalignment is purely handled by hardware, and its performance should be faster compared to regular instructions.

@lukel97
Copy link
Contributor

lukel97 commented Apr 15, 2024

Is it the case that unaligned scalar is supported while unaligned vector is not supported? #73971 cc @preames I don't remember the details, maybe t-head guys can help to answer this. @zixuan-wu @rjiejie

Before #73971 -mno-unaligned-access enabled both unaligned scalar and vector accesses anyway, so that was just an internal-facing change

@MaoHan001
Copy link

MaoHan001 commented Apr 16, 2024

https://lists.riscv.org/g/sig-vector/topic/questions_about_zicclsm/105531858?p=,,,20,0,0,0::recentpostdate/sticky,,,20,2,0,105531858
Based on the above thread, the description of "misalign" for vector in Zicclsm should be consistent with "8. Vector Memory Alignment Constraints" without making additional requirements for element misalignment.

@topperc
Copy link
Collaborator

topperc commented Apr 16, 2024

https://lists.riscv.org/g/sig-vector/topic/questions_about_zicclsm/105531858?p=,,,20,0,0,0::recentpostdate/sticky,,,20,2,0,105531858 Based on the above thread, the description of "misalign" for vector in Zicclsm should be consistent with "8. Vector Memory Alignment Constraints" without making additional requirements for element misalignment.

This post says the opposite https://lists.riscv.org/g/tech-unprivileged/topic/ar_minutes_2024_1_9/103776876?p=,,,20,0,0,0::recentpostdate/sticky,,,20,2,0,103776876,previd%3D1709213912495550035,nextid%3D1705361818666677767&previd=1709213912495550035&nextid=1705361818666677767

  • There was a discussion on whether the current profile text actually
    currently mandates vector misaligned support, and whether it should
    be mandated. The conclusion was that the current Zicclsm spec does
    mandate vector misaligned support, and that this is the correct
    thing to do. Note to be added to clarify.

@MaoHan001
Copy link

https://lists.riscv.org/g/sig-vector/topic/questions_about_zicclsm/105531858?p=,,,20,0,0,0::recentpostdate/sticky,,,20,2,0,105531858 Based on the above thread, the description of "misalign" for vector in Zicclsm should be consistent with "8. Vector Memory Alignment Constraints" without making additional requirements for element misalignment.

This post says the opposite https://lists.riscv.org/g/tech-unprivileged/topic/ar_minutes_2024_1_9/103776876?p=,,,20,0,0,0::recentpostdate/sticky,,,20,2,0,103776876,previd%3D1709213912495550035,nextid%3D1705361818666677767&previd=1709213912495550035&nextid=1705361818666677767

  • There was a discussion on whether the current profile text actually
    currently mandates vector misaligned support, and whether it should
    be mandated. The conclusion was that the current Zicclsm spec does
    mandate vector misaligned support, and that this is the correct
    thing to do. Note to be added to clarify.

This post does not seem to say an opposite argument; it only mentions the hope to clarify the definition of Zicclsm within the RVA23, as the current description does not appear very accurate. In the RVA23, the vector will be mandated, so it is expected by default to support misalignment for vectors as well.
Beside, the profile does not seem to involve whether the misalignment for vectors is regarding the entire vector register or the elements (which logically is the content of the vector spec).
Referring to the profile state table at https://docs.google.com/spreadsheets/d/1A40dfm0nnn2-tgKIhdi3UYQ1GBr8iRiV2edFowvgp7E/edit#gid=1157775000, both RVA22 and RVA20 are currently ratified, where the requirement for vectors is optional. Meanwhile, Ziccif is still marked as not ratified. It seems strange to proceed with an implementation based on a description that is not yet clear, rather than ratified specifications.

@topperc
Copy link
Collaborator

topperc commented Apr 16, 2024

Beside, the profile does not seem to involve whether the misalignment for vectors is regarding the entire vector register or the elements (which logically is the content of the vector spec).

A vector that is element aligned is not considered "misaligned" according to the vector spec. There wouldn't be any need to mention vector in the description of Zicclsm in the profile spec if it wasn't meant to refer to the element being misaligned.

@QJtaibai
Copy link

Zicclsm is not an arch extension in the full sense of the word. It is still called an extension name, but it really is just putting a name to a requirement that mandates what otherwise is optional behavior in an extension (the base ISA extensions in this case). Extension names like this get ratified in conjunction with ratification of the ISA profile that introduces/defines them.
Therefore, as long as the ' Vector Memory Alignment Constraints' definition in vector spec is followed, it is considered to have already followed the Zicclsm extension.
That is to say 'If an element accessed by a vector memory instruction is not naturally aligned to the size of the element, either the element is transferred successfully or an address misaligned exception is raised on that element'.

@topperc
Copy link
Collaborator

topperc commented Apr 16, 2024

Zicclsm is not an arch extension in the full sense of the word. It is still called an extension name, but it really is just putting a name to a requirement that mandates what otherwise is optional behavior in an extension (the base ISA extensions in this case). Extension names like this get ratified in conjunction with ratification of the ISA profile that introduces/defines them. Therefore, as long as the ' Vector Memory Alignment Constraints' definition in vector spec is followed, it is considered to have already followed the Zicclsm extension. That is to say 'If an element accessed by a vector memory instruction is not naturally aligned to the size of the element, either the element is transferred successfully or an address misaligned exception is raised on that element'.

My understanding has been that Zicclsm mandates that the element is transferred successfully in all cases and there is never a misaligned exception.

@asb
Copy link
Contributor

asb commented Apr 16, 2024

I just wanted to report back from the discussion at the end of last week in the RISC-V LLVM sync-up call.

  • As has also been highlighted here, we previously did have separate features for misaligned scalar and vector alignment, but merged them to match the frontend option as there weren't any users at the time requiring different behaviour.
  • There's no objection to reintroducing the split between scalar and vector misaligned access support, though to be usable this would likely require work to define what the frontend command-line options should be and also what changes are needed to the defines in riscv-c-api-doc.

This all seems very doable, but would need someone supporting one of these platforms to push it forwards - unfortunately no-one on the call was supporting such a platform, and while I could claim it's on someone's todo list, realistically there's enough other things on everyone's list it's not likely to happen until someone else comes in with patches. It would also be good to better characterise exactly which target devices have fast scalar but slow vector unaligned accesses.

topperc added a commit to topperc/llvm-project that referenced this issue Apr 16, 2024
…e backend.

This is largely a revert of commit e817966.

As llvm#88029 shows, there exists hardware that only supports unaligned scalar.

I'm leaving how this gets exposed to the clang interface to a future patch.
topperc added a commit that referenced this issue Apr 16, 2024
…e backend. (#88954)

This is largely a revert of commit
e817966.

As #88029 shows, there exists hardware that only supports unaligned
scalar.

I'm leaving how this gets exposed to the clang interface to a future
patch.
@topperc
Copy link
Collaborator

topperc commented Apr 18, 2024

Does hwprobe report that unaligned accesses are fast on this CPU?

sepehrst pushed a commit to spsforks/android-bionic-libc that referenced this issue Apr 22, 2024
hwprobe test may be insufficient to guarantee fast (or even supported) unaligned access.

Bug: google/android-riscv64#142
Bug: llvm/llvm-project#88029

Change-Id: Ib673c5b752da8630296926e5ec7f59f41b686016
hubot pushed a commit to aosp-mirror/platform_bionic that referenced this issue Apr 24, 2024
hwprobe test may be insufficient to guarantee fast (or even supported) unaligned access.

Bug: google/android-riscv64#142
Bug: llvm/llvm-project#88029

Change-Id: Ib673c5b752da8630296926e5ec7f59f41b686016
hubot pushed a commit to aosp-mirror/platform_bionic that referenced this issue Apr 26, 2024
hwprobe test may be insufficient to guarantee fast (or even supported) unaligned access.
Test case based on: llvm/llvm-project#88029

Previous commit got reverted due to compiler errors(b/336800888).
Not sure why the errors were not detected in pre-submit builds.

Bug: google/android-riscv64#142

Change-Id: If1c4150701298c0f351baa9ce1870509a00c250a
@appujee
Copy link
Contributor

appujee commented May 14, 2024

Does hwprobe report that unaligned accesses are fast on this CPU?

@sh1boot can you provide the info?

topperc added a commit to topperc/llvm-project that referenced this issue May 14, 2024
…e backend. (llvm#88954)

This is largely a revert of commit
e817966.

As llvm#88029 shows, there exists hardware that only supports unaligned
scalar.

I'm leaving how this gets exposed to the clang interface to a future
patch.
@sh1boot
Copy link
Author

sh1boot commented May 15, 2024

It says Function not implemented. I'll need to go digging around to see what my options are for a viable repo with a more modern kernel.

@sh1boot
Copy link
Author

sh1boot commented May 17, 2024

Does hwprobe report that unaligned accesses are fast on this CPU?

@cyyself, you seem to be actively maintaining kernel patches for k230 on a modern kernel. Can you see what this query returns?

@cyyself
Copy link

cyyself commented May 18, 2024

Does hwprobe report that unaligned accesses are fast on this CPU?

@cyyself, you seem to be actively maintaining kernel patches for k230 on a modern kernel. Can you see what this query returns?

I write https://github.com/cyyself/hwprobe to have a hwprobe example. The output on k230 is:

[cyy@archlinux hwprobe]$ ./hwprobe
MVENDORID: 5b7
MARCHID: 8000000009140d00
MIMPID: 50000
BASE_BEHAVIOR: IMA_
IMA_EXT_0: FD_C_V_Zba_Zbb_Zbs_Zbc_
CPUPERF_0: MISALIGNED-FAST_
ZICBOZ_BLOCK_SIZE: 64 Bytes
[cyy@archlinux hwprobe]$ dmesg | grep aligned
[    0.163736] cpu0: Ratio of byte access time to unaligned word access is 6.32, unaligned accesses are fast

I think the unaligned access probe through the hwprobe syscall on the newer kernel is OK.

@topperc
Copy link
Collaborator

topperc commented May 18, 2024

Does hwprobe report that unaligned accesses are fast on this CPU?

@cyyself, you seem to be actively maintaining kernel patches for k230 on a modern kernel. Can you see what this query returns?

I write https://github.com/cyyself/hwprobe to have a hwprobe example. The output on k230 is:

[cyy@archlinux hwprobe]$ ./hwprobe
MVENDORID: 5b7
MARCHID: 8000000009140d00
MIMPID: 50000
BASE_BEHAVIOR: IMA_
IMA_EXT_0: FD_C_V_Zba_Zbb_Zbs_Zbc_
CPUPERF_0: MISALIGNED-FAST_
ZICBOZ_BLOCK_SIZE: 64 Bytes
[cyy@archlinux hwprobe]$ dmesg | grep aligned
[    0.163736] cpu0: Ratio of byte access time to unaligned word access is 6.32, unaligned accesses are fast

I think the unaligned access probe through the hwprobe syscall on the newer kernel is OK.

hwprobe documentation does not say whether the unaligned access check is for scalar or scalar+vector. Based on this result, the implementation is only checking scalar. So I think the hwprobe documentation should be updated to say that explicitly if that what it is going to check.

@cyyself
Copy link

cyyself commented May 19, 2024

hwprobe documentation does not say whether the unaligned access check is for scalar or scalar+vector. Based on this result, the implementation is only checking scalar. So I think the hwprobe documentation should be updated to say that explicitly if that what it is going to check.

Someone did this: https://lore.kernel.org/linux-riscv/CAJgzZorn5anPH8dVPqvjVWmLKqTi5bkLDR=FH-ZAcdXFnNe8Eg@mail.gmail.com/

@topperc
Copy link
Collaborator

topperc commented May 19, 2024

hwprobe documentation does not say whether the unaligned access check is for scalar or scalar+vector. Based on this result, the implementation is only checking scalar. So I think the hwprobe documentation should be updated to say that explicitly if that what it is going to check.

Someone did this: https://lore.kernel.org/linux-riscv/CAJgzZorn5anPH8dVPqvjVWmLKqTi5bkLDR=FH-ZAcdXFnNe8Eg@mail.gmail.com/

Thanks I guess whatever I found while searching was old.

@appujee
Copy link
Contributor

appujee commented May 22, 2024

How does vectorizer deal with this issue? cc: @alexey-bataev @nikolaypanchenko

@nikolaypanchenko
Copy link
Contributor

How does vectorizer deal with this issue? cc: @alexey-bataev @nikolaypanchenko

It relies on TTI to tell if misaligned vector load/store is legal or not. If misaligned access can be done via set of instructions, it's expected TTI will return cost of this sequence.
For instance, for unmasked misaligned vle64 can be emulated via vle8 with 8*VL.

However, I don't think we do have anything to notify compiler that hw supports misaligned access.

@topperc
Copy link
Collaborator

topperc commented May 22, 2024

However, I don't think we do have anything to notify compiler that hw supports misaligned access.

-mno-strict-align should tell TTI that misaligned accesses are fine. I'll double check.

@nikolaypanchenko
Copy link
Contributor

-mno-strict-align should tell TTI that misaligned accesses are fine. I'll double check.

Shouldn't it be driven by extension ? Otherwise compiled binary won't be portable

@topperc
Copy link
Collaborator

topperc commented May 22, 2024

-mno-strict-align should tell TTI that misaligned accesses are fine. I'll double check.

Shouldn't it be driven by extension ? Otherwise compiled binary won't be portable

It's a messy area. The only related extension defined is Zicclsm, but that only says it won't crash. I wouldn't want to vectorize under only the guarantees from that extension.

@topperc
Copy link
Collaborator

topperc commented May 22, 2024

The -Xclang -target-feature -Xclang +unaligned-scalar-mem from clang 17 should work on LLVM 18.1.6 as a temporary solution until we resolve what clang options should be.

@cyyself
Copy link

cyyself commented May 24, 2024

When passed --target=riscv64-linux-gnu -march=rv64gcv -mno-strict-align, clang-18 and up generate vle64 and vse64 instructions for the following code despite receiving byte pointers:

__attribute__((noinline))
void vector_memcpy(uint8_t* d, uint8_t const* p) {
  __builtin_memcpy(d, p, 16);
}

https://godbolt.org/z/eb4oqM3Ts

If the pointer is unaligned then my Kendryte K230 board (C908) will raise a bus error trying to execute this.

Just note that the bus error is cause: 0x4 in CSR.scause, which is load misaligned:

[986636.930796] a.out[39527]: unhandled signal 7 code 0x1 at 0x0000002ae1ab16fc in a.out[2ae1ab1000+1000]
[986636.940135] CPU: 0 PID: 39527 Comm: a.out Not tainted 6.8.0+ #6
[986636.946143] Hardware name: Canaan Kendryte K230 (DT)
[986636.951192] epc : 0000002ae1ab16fc ra : 0000002ae1ab1750 sp : 0000003fc00c3960
[986636.958499]  gp : 0000002ae1ab3818 tp : 0000003f903903c0 t0 : fffffffffd40322d
[986636.965805]  t1 : 0000003f90290a8e t2 : 0000000003f500c8 s0 : 0000002ae1ab30a1
[986636.973111]  s1 : 0000002ae1ab3021 a0 : 0000002ae1ab30a1 a1 : 0000002ae1ab3021
[986636.980417]  a2 : 0000002ae1ab42a0 a3 : 0000000000000000 a4 : 0000000000000000
[986636.987722]  a5 : fffffffffbad2a84 a6 : 6561327830203a73 a7 : 0000000000000040
[986636.995029]  s2 : 0000000000000000 s3 : 0000002ae1ab2dd8 s4 : 0000002ae1ab1706
[986637.002334]  s5 : 0000003fc00c3b18 s6 : 0000002ae1ab2dd8 s7 : 0000003f903bcd10
[986637.009639]  s8 : 0000003f903bd008 s9 : 0000002acfd18394 s10: 0000000000000000
[986637.016945]  s11: 0000002acfd18390 t3 : 000000000000002d t4 : 0000000000000000
[986637.024250]  t5 : 0000003f903be260 t6 : 0000003f9038f528
[986637.029646] status: 8000000200006620 badaddr: 0000002ae1ab3021 cause: 0000000000000004
[986637.037653] Code: 4785 0023 00f4 60a2 6402 0141 8082 bf69 7057 cd88 (f407) 0205

We may need to implement the misaligned vector load/store emulation on kernel / SBI to handle this.

@palmer-dabbelt
Copy link
Contributor

negge pointed me to this, there's a GCC patch with some ongoing discussion as well: https://inbox.sourceware.org/gcc-patches/mhng-69b2d28b-6f08-4560-9120-1f8efdd89051@palmer-ri-x1c9/T/#m59240e12852641aa7221ab2d926a5ec640fb0b42

@palmer-dabbelt
Copy link
Contributor

-mno-strict-align should tell TTI that misaligned accesses are fine. I'll double check.

Shouldn't it be driven by extension ? Otherwise compiled binary won't be portable

It's a messy area. The only related extension defined is Zicclsm, but that only says it won't crash. I wouldn't want to vectorize under only the guarantees from that extension.

That's likely the way we'll go in GCC as well. If the implementation is trapping to M-mode then we're almost certainly better off generating different code sequences, whether that's scalar or smaller-aligned vector accesses. So the "it doesn't SIGILL in userspace" extensions don't really buy us anything here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests