-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RISCV -mno-strict-align
generates problem instructions for some targets
#88029
Comments
@llvm/issue-subscribers-backend-risc-v Author: S H (sh1boot)
When passed ` --target=riscv64-linux-gnu -march=rv64gcv -mno-strict-align`, clang-18 and up generate `vle64` and `vse64` instructions for the following code despite receiving byte pointers:
__attribute__((noinline))
void vector_memcpy(uint8_t* d, uint8_t const* p) {
__builtin_memcpy(d, p, 16);
} https://godbolt.org/z/eb4oqM3Ts If the pointer is unaligned then my Kendryte K230 board (C908) will raise a bus error trying to execute this. However, Scalar code "seems to work" when you cast a byte pointer to a wider type when it's unaligned, but there are other issues with that and most portable code prefers |
Is unaligned scalar really supported in hardware on C908, or is it emulated via a trap and the trap handler doesn't know how to emulate vector? |
Unaligned scalar loads definitely make my test code run much faster on C908. If it is emulated then it's emulated so efficiently that it's still better than using extra application code to avoid un-aligned loads. |
Is it the case that unaligned scalar is supported while unaligned vector is not supported? |
Well on the k230 SOC, unaligned scalar memory access does work and does not have a significant performance impact. It is much faster than loading bytes and assembling them into a word in application code. Enabling this optimisation has a big impact on performance. Meanwhile, vle8 works at arbitrary memory offsets, and vle64 raises a bus error if given an odd memory address. Is it supposed to be configurable in the kernel? I don't know. Somebody is shipping kernels configured this way anyway. Clang-17 currently generates much faster code than clang-18 and clang-19 because the latter cannot emit unaligned scalar access safely. |
For llvm, we should have a way to differentiate fast unaligned access support for both scalars and vectors. The current -mno-strict-align is confusing and resulting in bug like this. See google/android-riscv64 for the context. |
Just summarising my understanding of things so people can point out if I have it wrong or disagree:
It would be useful to know if there is other hardware where scalar misaligned access are fast, but vector slow. If it's basically unheard of, that might indicate different solutions in terms of command-line interface vs if it's likely to happen quite a lot. In the latter case, perhaps we got the interface for riscv-non-isa/riscv-c-api-doc#32 wrong. |
Makes sense to not have compiler flags as the misaligned access support for both scalar and vectors is clearly required for the RVA20 profile. Thanks for pointing to the reference. |
I believe I tried Zicclsm as part of If that is the same thing then I guess C908 might be considered an RVA20 compliant device if something emulated (or enabled) unaligned vector memory ops by default? But I think RVA20 isn't telling us if the operation is fast enough to use, right? Just that it shouldn't give a bus error. That's what the command line argument would be for iff C908 became a device that met the required spec or if there were others like it? |
C908 supports general unaligned access for vectors, but not element unaligned access in vectors, which should comply with the relevant spec requirements: |
Before #73971 |
https://lists.riscv.org/g/sig-vector/topic/questions_about_zicclsm/105531858?p=,,,20,0,0,0::recentpostdate/sticky,,,20,2,0,105531858 |
This post says the opposite https://lists.riscv.org/g/tech-unprivileged/topic/ar_minutes_2024_1_9/103776876?p=,,,20,0,0,0::recentpostdate/sticky,,,20,2,0,103776876,previd%3D1709213912495550035,nextid%3D1705361818666677767&previd=1709213912495550035&nextid=1705361818666677767
|
This post does not seem to say an opposite argument; it only mentions the hope to clarify the definition of Zicclsm within the RVA23, as the current description does not appear very accurate. In the RVA23, the vector will be mandated, so it is expected by default to support misalignment for vectors as well. |
A vector that is element aligned is not considered "misaligned" according to the vector spec. There wouldn't be any need to mention vector in the description of Zicclsm in the profile spec if it wasn't meant to refer to the element being misaligned. |
Zicclsm is not an arch extension in the full sense of the word. It is still called an extension name, but it really is just putting a name to a requirement that mandates what otherwise is optional behavior in an extension (the base ISA extensions in this case). Extension names like this get ratified in conjunction with ratification of the ISA profile that introduces/defines them. |
My understanding has been that Zicclsm mandates that the element is transferred successfully in all cases and there is never a misaligned exception. |
I just wanted to report back from the discussion at the end of last week in the RISC-V LLVM sync-up call.
This all seems very doable, but would need someone supporting one of these platforms to push it forwards - unfortunately no-one on the call was supporting such a platform, and while I could claim it's on someone's todo list, realistically there's enough other things on everyone's list it's not likely to happen until someone else comes in with patches. It would also be good to better characterise exactly which target devices have fast scalar but slow vector unaligned accesses. |
…e backend. This is largely a revert of commit e817966. As llvm#88029 shows, there exists hardware that only supports unaligned scalar. I'm leaving how this gets exposed to the clang interface to a future patch.
Does hwprobe report that unaligned accesses are fast on this CPU? |
hwprobe test may be insufficient to guarantee fast (or even supported) unaligned access. Bug: google/android-riscv64#142 Bug: llvm/llvm-project#88029 Change-Id: Ib673c5b752da8630296926e5ec7f59f41b686016
hwprobe test may be insufficient to guarantee fast (or even supported) unaligned access. Bug: google/android-riscv64#142 Bug: llvm/llvm-project#88029 Change-Id: Ib673c5b752da8630296926e5ec7f59f41b686016
hwprobe test may be insufficient to guarantee fast (or even supported) unaligned access. Test case based on: llvm/llvm-project#88029 Previous commit got reverted due to compiler errors(b/336800888). Not sure why the errors were not detected in pre-submit builds. Bug: google/android-riscv64#142 Change-Id: If1c4150701298c0f351baa9ce1870509a00c250a
@sh1boot can you provide the info? |
…e backend. (llvm#88954) This is largely a revert of commit e817966. As llvm#88029 shows, there exists hardware that only supports unaligned scalar. I'm leaving how this gets exposed to the clang interface to a future patch.
It says |
@cyyself, you seem to be actively maintaining kernel patches for k230 on a modern kernel. Can you see what this query returns? |
I write https://github.com/cyyself/hwprobe to have a hwprobe example. The output on k230 is: [cyy@archlinux hwprobe]$ ./hwprobe
MVENDORID: 5b7
MARCHID: 8000000009140d00
MIMPID: 50000
BASE_BEHAVIOR: IMA_
IMA_EXT_0: FD_C_V_Zba_Zbb_Zbs_Zbc_
CPUPERF_0: MISALIGNED-FAST_
ZICBOZ_BLOCK_SIZE: 64 Bytes
[cyy@archlinux hwprobe]$ dmesg | grep aligned
[ 0.163736] cpu0: Ratio of byte access time to unaligned word access is 6.32, unaligned accesses are fast I think the unaligned access probe through the hwprobe syscall on the newer kernel is OK. |
hwprobe documentation does not say whether the unaligned access check is for scalar or scalar+vector. Based on this result, the implementation is only checking scalar. So I think the hwprobe documentation should be updated to say that explicitly if that what it is going to check. |
Someone did this: https://lore.kernel.org/linux-riscv/CAJgzZorn5anPH8dVPqvjVWmLKqTi5bkLDR=FH-ZAcdXFnNe8Eg@mail.gmail.com/ |
Thanks I guess whatever I found while searching was old. |
How does vectorizer deal with this issue? cc: @alexey-bataev @nikolaypanchenko |
It relies on TTI to tell if misaligned vector load/store is legal or not. If misaligned access can be done via set of instructions, it's expected TTI will return cost of this sequence. However, I don't think we do have anything to notify compiler that hw supports misaligned access. |
-mno-strict-align should tell TTI that misaligned accesses are fine. I'll double check. |
Shouldn't it be driven by extension ? Otherwise compiled binary won't be portable |
It's a messy area. The only related extension defined is Zicclsm, but that only says it won't crash. I wouldn't want to vectorize under only the guarantees from that extension. |
The |
Just note that the bus error is [986636.930796] a.out[39527]: unhandled signal 7 code 0x1 at 0x0000002ae1ab16fc in a.out[2ae1ab1000+1000]
[986636.940135] CPU: 0 PID: 39527 Comm: a.out Not tainted 6.8.0+ #6
[986636.946143] Hardware name: Canaan Kendryte K230 (DT)
[986636.951192] epc : 0000002ae1ab16fc ra : 0000002ae1ab1750 sp : 0000003fc00c3960
[986636.958499] gp : 0000002ae1ab3818 tp : 0000003f903903c0 t0 : fffffffffd40322d
[986636.965805] t1 : 0000003f90290a8e t2 : 0000000003f500c8 s0 : 0000002ae1ab30a1
[986636.973111] s1 : 0000002ae1ab3021 a0 : 0000002ae1ab30a1 a1 : 0000002ae1ab3021
[986636.980417] a2 : 0000002ae1ab42a0 a3 : 0000000000000000 a4 : 0000000000000000
[986636.987722] a5 : fffffffffbad2a84 a6 : 6561327830203a73 a7 : 0000000000000040
[986636.995029] s2 : 0000000000000000 s3 : 0000002ae1ab2dd8 s4 : 0000002ae1ab1706
[986637.002334] s5 : 0000003fc00c3b18 s6 : 0000002ae1ab2dd8 s7 : 0000003f903bcd10
[986637.009639] s8 : 0000003f903bd008 s9 : 0000002acfd18394 s10: 0000000000000000
[986637.016945] s11: 0000002acfd18390 t3 : 000000000000002d t4 : 0000000000000000
[986637.024250] t5 : 0000003f903be260 t6 : 0000003f9038f528
[986637.029646] status: 8000000200006620 badaddr: 0000002ae1ab3021 cause: 0000000000000004
[986637.037653] Code: 4785 0023 00f4 60a2 6402 0141 8082 bf69 7057 cd88 (f407) 0205 We may need to implement the misaligned vector load/store emulation on kernel / SBI to handle this. |
negge pointed me to this, there's a GCC patch with some ongoing discussion as well: https://inbox.sourceware.org/gcc-patches/mhng-69b2d28b-6f08-4560-9120-1f8efdd89051@palmer-ri-x1c9/T/#m59240e12852641aa7221ab2d926a5ec640fb0b42 |
That's likely the way we'll go in GCC as well. If the implementation is trapping to M-mode then we're almost certainly better off generating different code sequences, whether that's scalar or smaller-aligned vector accesses. So the "it doesn't SIGILL in userspace" extensions don't really buy us anything here. |
When passed
--target=riscv64-linux-gnu -march=rv64gcv -mno-strict-align
, clang-18 and up generatevle64
andvse64
instructions for the following code despite receiving byte pointers:https://godbolt.org/z/eb4oqM3Ts
If the pointer is unaligned then my Kendryte K230 board (C908) will raise a bus error trying to execute this.
However,
-mno-strict-align
offers big performance gains for some scalar code, where unaligned access is supported just fine. There seems to be no way to specify that vector code should respect alignments while scalar code can do unaligned access.Scalar code "seems to work" when you cast a byte pointer to a wider type when it's unaligned, but there are other issues with that and most portable code prefers
memcpy()
with the expectation that the compiler will optimise it correctly.The text was updated successfully, but these errors were encountered: