-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
So I added <string.h> again. #9
base: master
Are you sure you want to change the base?
Conversation
Do not unroll loops manually! The compiler is perfectly capable of doing that, but in most cases will not do this, because on modern CPUs unrolling loops kills performance! Cache coherence and this compact code is more important than the jumps of a loop. In fact a very regular execution pattern, like in the loops found in matrix calculations will quickly bootstrap the branch predictor. Also very simple inner loops which act on a short number of elements give the compiler the opportunity to vectorize the calculation (which modern compilers will do). Regarding mat4x4_identity your version has 25% memory access overhead: 16 zero writes + 4 one writes. While this will very likely happen within L1 cache it's still a memory access. Memory accesses are worse than spending additional cycles. BTW The code could be rewritten (without the ternary operator), but given the instruction set it will have to translate to something similar anyway, so I'd not bother with it.
|
Hmm, I hadn't thought of compiler optimization. I guess the for-loop unrolling is pretty silly. I love the ternary operator, that's definitely not why I changed the code. I'm not sure I understand why you don't favor less cycles, but here's a lazy benchmark I made for 1 billion consecutive calls to mat4x4_identity. Anyway, thanks for the lesson! nested for-loop mat4x4_identity(): 74.503780s |
Because when it comes to memory access related operations cycles are cheaper than potential memory arbitration or address generation interlock. On architectures with vast L1 and L2 caches the memory access will be cheaper, yes. But on architectures without as efficient caches things start to look very different. Now given your simple test, the nested for-loop + conditional variant has a performance loss of (74.5/14.95)/1e9 = 4.9e-7% on your machine and my money would be on the conditional for being the culprit. I'll have to test how this thing performs on something like a Cortex-M4. |
I'd be interested to know. I'm testing on x86 64-bit architecture. 1E9 mat4x4_identity() calls: 1E9 vec3_add() calls: 1E9 vec4_mul_inner() calls: |
Could you please specify your build options? |
[Edit] Using gcc 4.8.1. Your version does work faster than mine with the compiler flag "-O3". I guess I had to learn about optimization flags eventually. Thanks for your patience. You can close this issue if you want. |
Fixed by #40 |
Take a look at what I did in mat4x4_dup() and mat4x4_identity(). Maybe I should have just modified those and attempted to merge, but there seems to be a lot of unnecessary for-loops in the library.
I can remove those changes if you want, but it could be faster. We have the technology!