So I added <string.h> again. #9

ghost · 2014-01-22T06:01:22Z

Take a look at what I did in mat4x4_dup() and mat4x4_identity(). Maybe I should have just modified those and attempted to merge, but there seems to be a lot of unnecessary for-loops in the library.

I can remove those changes if you want, but it could be faster. We have the technology!

datenwolf · 2014-01-22T13:29:49Z

Do not unroll loops manually! The compiler is perfectly capable of doing that, but in most cases will not do this, because on modern CPUs unrolling loops kills performance! Cache coherence and this compact code is more important than the jumps of a loop. In fact a very regular execution pattern, like in the loops found in matrix calculations will quickly bootstrap the branch predictor.

Also very simple inner loops which act on a short number of elements give the compiler the opportunity to vectorize the calculation (which modern compilers will do).

Regarding mat4x4_identity your version has 25% memory access overhead: 16 zero writes + 4 one writes. While this will very likely happen within L1 cache it's still a memory access. Memory accesses are worse than spending additional cycles. BTW The code could be rewritten (without the ternary operator), but given the instruction set it will have to translate to something similar anyway, so I'd not bother with it.

M[i][j] = (float)(i==j)

ghost · 2014-01-22T17:39:54Z

Hmm, I hadn't thought of compiler optimization. I guess the for-loop unrolling is pretty silly.

I love the ternary operator, that's definitely not why I changed the code. I'm not sure I understand why you don't favor less cycles, but here's a lazy benchmark I made for 1 billion consecutive calls to mat4x4_identity. Anyway, thanks for the lesson!

nested for-loop mat4x4_identity(): 74.503780s
memset mat4x4_identity(): 14.950094s

datenwolf · 2014-01-22T17:48:49Z

Because when it comes to memory access related operations cycles are cheaper than potential memory arbitration or address generation interlock. On architectures with vast L1 and L2 caches the memory access will be cheaper, yes. But on architectures without as efficient caches things start to look very different.

Now given your simple test, the nested for-loop + conditional variant has a performance loss of (74.5/14.95)/1e9 = 4.9e-7% on your machine and my money would be on the conditional for being the culprit.

I'll have to test how this thing performs on something like a Cortex-M4.

ghost · 2014-01-22T18:14:51Z

I'd be interested to know. I'm testing on x86 64-bit architecture.

1E9 mat4x4_identity() calls:
nested for-loop: 72.967231s
memset: 12.297672s
unrolled for-loop: 25.577239s

1E9 vec3_add() calls:
for-loop: 28.239913s
unrolled: 10.524108s

1E9 vec4_mul_inner() calls:
for-loop: 38.277205s
unrolled: 16.387929s

datenwolf · 2014-01-23T13:02:17Z

Could you please specify your build options?

ghost · 2014-01-23T21:29:42Z

[Edit] Using gcc 4.8.1. Your version does work faster than mine with the compiler flag "-O3". I guess I had to learn about optimization flags eventually. Thanks for your patience.

You can close this issue if you want.

iTitus · 2021-01-31T21:05:59Z

Fixed by #40

eadle added 3 commits January 21, 2014 21:46

mat4x4 improvements, removed some unnecessary for-loops.

b5d4e7a

remove for-loop under memset

8c3b8bf

even less for-loops

b412e3f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

So I added <string.h> again. #9

So I added <string.h> again. #9

ghost commented Jan 22, 2014

datenwolf commented Jan 22, 2014

ghost commented Jan 22, 2014

datenwolf commented Jan 22, 2014

ghost commented Jan 22, 2014

datenwolf commented Jan 23, 2014

ghost commented Jan 23, 2014

iTitus commented Jan 31, 2021

So I added <string.h> again. #9

Are you sure you want to change the base?

So I added <string.h> again. #9

Conversation

ghost commented Jan 22, 2014

datenwolf commented Jan 22, 2014

ghost commented Jan 22, 2014

datenwolf commented Jan 22, 2014

ghost commented Jan 22, 2014

datenwolf commented Jan 23, 2014

ghost commented Jan 23, 2014

iTitus commented Jan 31, 2021