Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't send commas to stage 2, avoid clmul in most cases #2049

Draft
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

jkeiser
Copy link
Member

@jkeiser jkeiser commented Aug 15, 2023

The algorithm detects all missing/extra separator errors in stage 1, and then doesn't send commas.

borrow_out = result >= value1;
return result;
#else
return __builtin_subcll(value1, value2, borrow, &borrow);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance, it looks like __builtin_subcll is LLVM specific?

It might be worth guarding its usage:
https://gcc.gnu.org/onlinedocs/cpp/_005f_005fhas_005fbuiltin.html

It might worth examining alternatives:

https://godbolt.org/z/1WT9nPv6M

#include <cstdint>

using borrow_t = unsigned long long;

uint64_t subtract_borrow(const uint64_t value1, const uint64_t value2, borrow_t& borrow) noexcept {
   return __builtin_subcll(value1, value2, borrow, &borrow);
}

uint64_t subtract_borrow_manual(const uint64_t value1, const uint64_t value2, borrow_t& borrow) noexcept {
  uint64_t result = value1 - value2 - borrow;
  borrow = result >= value1;
  return result;
}
#if defined(_M_X64) || defined(__amd64__)
#include <x86intrin.h>

// visual studio has _subborrow_u64 in <intrin.h>
// https://learn.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list?view=msvc-170
//
uint64_t subtract_borrow_intel(const uint64_t value1, const uint64_t value2, uint8_t& borrow) {
    uint64_t result;
    borrow = _subborrow_u64(borrow, value2, value1, (unsigned long long *)&result);
    return result;
}
#endif

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's entirely possible; I did construct these to be analogues of the add_overflow() implementation for the given architecture (i.e. pulling from the same libraries and using the same #ifdefs)

@jkeiser
Copy link
Member Author

jkeiser commented Aug 30, 2023

@lemire I made some more variants in this Godbolt. Just based on manual inspection of the assembly, if I had to choose a subtract_borrow implementation, I would choose the clang one, because:

  • The manual version is a tiny bit longer (though neither seems particularly bad).
  • __builtin_subcll_overflow(a, b + borrow) produces significantly shorter code than __builtin_subcll(a, b, borrow). This is very strange.
  • Storing the overflow in a bool is universally shorter, in part because they are guaranteed to be 0 and 1 and therefore can be set to a flag; and in part because flags can be quickly moved to bools but not to 64-bit values.
  • Other than that, the builtins are slightly shorter than manual, but really not by very much.
uint64_t subtract_borrow_using_overflow_bool(const uint64_t value1, const uint64_t value2, bool& borrow) {
  unsigned long long result;
  borrow = __builtin_usubll_overflow(value1, value2 + borrow, &result);
  return result;
}

uint64_t subtract_borrow_intel_bool(const uint64_t value1, const uint64_t value2, bool& borrow) {
    unsigned long long result;
    borrow = _subborrow_u64(borrow, value1, value2, (unsigned long long *)&result);
    return result;
}

uint64_t subtract_borrow_manual_bool(const uint64_t value1, const uint64_t value2, bool& borrow) noexcept {
  uint64_t result = value1 - value2 - borrow;
  borrow = result >= value1;
  return result;
}
subtract_borrow_using_overflow_bool(unsigned long, unsigned long, bool&):
        # result = value1 - value2 - overflow
        mov     rax, rdi # value1
        movzx   ecx, byte ptr [rdx] # overflow
        add     rcx, rsi # overflow + value2
        sub     rax, rcx # value1 - (overflow + value2)

        # overflow = result >= value1
        setb    byte ptr [rdx]

subtract_borrow_intel_bool(unsigned long, unsigned long, bool&):
        # result = value1 - value2 - overflow
        mov     rax, rdi # value1
        movzx   ecx, byte ptr [rdx] # overflow
        add     cl, -1 # cl = overflow - 1 ???
        sbb     rax, rsi # value1 - value2 - overflow

        # overflow = result >= value1
        setb    byte ptr [rdx]

subtract_borrow_manual_bool(unsigned long, unsigned long, bool&):
        # result = value1 - (value2 + overflow)
        movzx   ecx, byte ptr [rdx] # overflow
        add     rcx, rsi # overflow + value2
        sub     rax, rcx # value1 - (value2 + overflow)

        # overflow = (result >= value1)
        cmp     rax, rdi
        setae   byte ptr [rdx]

@jkeiser
Copy link
Member Author

jkeiser commented Aug 30, 2023

Added some comments in the assembly for easier following.

Bottom line on Ice Lake, once you use bool overflow:

  1. value1 - value2 - borrow; overflow = result >= value1 is the best. It produces the same number of instructions as the others, but two of them are purely for computing overflow--only 3 instructions (with lower latency) are required to compute the result.
  2. __builtin_usubll_overflow(value1, value2 + borrow) is the second best, with the same number of instructions but a longer chain to calculate the result.
  3. _subborrow_u64(borrow, value1, value2) is the worst, acting similarly to __builtin_usubll_overflow but using SBB, which runs on only 2 ports instead of SUB's 4.

Of course, running these in a performance test is the only way to know for sure, since the processor does some minor JIT-ish activities :)

@jkeiser
Copy link
Member Author

jkeiser commented Aug 30, 2023

On ARM, it looks like __builtin_usubll_overflow(value1, value2 + borrow) is the winner, with __builtin_subcll once again producing more instructions, and the manual version producing a lot more instructions.

@jkeiser
Copy link
Member Author

jkeiser commented Aug 30, 2023

And on GCC 13.2 Intel, everything pretty much looks the same as each other. _subborrow_u64(0, value1, value2 + borrow) wins by not producing an extra AND. _subborrow_u64(borrow, value1, value2) loses because of SBB, as with clang.

subtract_borrow_using_overflow_bool(unsigned long, unsigned long, bool&):
        movzx   ecx, BYTE PTR [rdx]
        mov     rax, rdi
        add     rcx, rsi
        sub     rax, rcx

        setb    BYTE PTR [rdx]
        and     BYTE PTR [rdx], 1

subtract_borrow_manual_bool(unsigned long, unsigned long, bool&):
        movzx   ecx, BYTE PTR [rdx]
        mov     rax, rdi
        sub     rax, rsi
        sub     rax, rcx

        cmp     rax, rdi
        setnb   BYTE PTR [rdx]

subtract_borrow_intel_bool(unsigned long, unsigned long, bool&):
        movzx   ecx, BYTE PTR [rdx]
        mov     rax, rdi
        add     cl, -1
        sbb     rax, rsi

        setc    BYTE PTR [rdx]

subtract_borrow_using_overflow_intel_bool(unsigned long, unsigned long, bool&):
        movzx   ecx, BYTE PTR [rdx]
        mov     rax, rdi
        add     rcx, rsi
        sub     rax, rcx

        setb    BYTE PTR [rdx]

@jkeiser
Copy link
Member Author

jkeiser commented Aug 30, 2023

Changing subtract_borrow to the manual version on Ice Lake brings stage 1 down from 93.2353 -> 93.0503 instructions/block and 28.4788 -> 27.7167 cycles/block, a nearly 3% speed improvement. I'll see if it can rescue the speculative string parser, as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants