Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

So I added <string.h> again. #9

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

So I added <string.h> again. #9

wants to merge 3 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Jan 22, 2014

Take a look at what I did in mat4x4_dup() and mat4x4_identity(). Maybe I should have just modified those and attempted to merge, but there seems to be a lot of unnecessary for-loops in the library.

I can remove those changes if you want, but it could be faster. We have the technology!

@datenwolf
Copy link
Owner

Do not unroll loops manually! The compiler is perfectly capable of doing that, but in most cases will not do this, because on modern CPUs unrolling loops kills performance! Cache coherence and this compact code is more important than the jumps of a loop. In fact a very regular execution pattern, like in the loops found in matrix calculations will quickly bootstrap the branch predictor.

Also very simple inner loops which act on a short number of elements give the compiler the opportunity to vectorize the calculation (which modern compilers will do).

Regarding mat4x4_identity your version has 25% memory access overhead: 16 zero writes + 4 one writes. While this will very likely happen within L1 cache it's still a memory access. Memory accesses are worse than spending additional cycles. BTW The code could be rewritten (without the ternary operator), but given the instruction set it will have to translate to something similar anyway, so I'd not bother with it.

M[i][j] = (float)(i==j)

@ghost
Copy link
Author

ghost commented Jan 22, 2014

Hmm, I hadn't thought of compiler optimization. I guess the for-loop unrolling is pretty silly.

I love the ternary operator, that's definitely not why I changed the code. I'm not sure I understand why you don't favor less cycles, but here's a lazy benchmark I made for 1 billion consecutive calls to mat4x4_identity. Anyway, thanks for the lesson!

nested for-loop mat4x4_identity(): 74.503780s
memset mat4x4_identity(): 14.950094s

@datenwolf
Copy link
Owner

Because when it comes to memory access related operations cycles are cheaper than potential memory arbitration or address generation interlock. On architectures with vast L1 and L2 caches the memory access will be cheaper, yes. But on architectures without as efficient caches things start to look very different.

Now given your simple test, the nested for-loop + conditional variant has a performance loss of (74.5/14.95)/1e9 = 4.9e-7% on your machine and my money would be on the conditional for being the culprit.

I'll have to test how this thing performs on something like a Cortex-M4.

@ghost
Copy link
Author

ghost commented Jan 22, 2014

I'd be interested to know. I'm testing on x86 64-bit architecture.

1E9 mat4x4_identity() calls:
nested for-loop: 72.967231s
memset: 12.297672s
unrolled for-loop: 25.577239s

1E9 vec3_add() calls:
for-loop: 28.239913s
unrolled: 10.524108s

1E9 vec4_mul_inner() calls:
for-loop: 38.277205s
unrolled: 16.387929s

@datenwolf
Copy link
Owner

Could you please specify your build options?

@ghost
Copy link
Author

ghost commented Jan 23, 2014

[Edit] Using gcc 4.8.1. Your version does work faster than mine with the compiler flag "-O3". I guess I had to learn about optimization flags eventually. Thanks for your patience.

You can close this issue if you want.

@iTitus
Copy link

iTitus commented Jan 31, 2021

Fixed by #40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants