Don't send commas to stage 2, avoid clmul in most cases #2049

jkeiser · 2023-08-15T02:54:02Z

The algorithm detects all missing/extra separator errors in stage 1, and then doesn't send commas.

lemire · 2023-08-30T00:12:23Z

include/simdjson/ppc64/bitmask.h

+  borrow_out = result >= value1;
+  return result;
+#else
+  return __builtin_subcll(value1, value2, borrow, &borrow);


At a glance, it looks like __builtin_subcll is LLVM specific?

It might be worth guarding its usage:
https://gcc.gnu.org/onlinedocs/cpp/_005f_005fhas_005fbuiltin.html

It might worth examining alternatives:

https://godbolt.org/z/1WT9nPv6M

#include <cstdint> using borrow_t = unsigned long long; uint64_t subtract_borrow(const uint64_t value1, const uint64_t value2, borrow_t& borrow) noexcept { return __builtin_subcll(value1, value2, borrow, &borrow); } uint64_t subtract_borrow_manual(const uint64_t value1, const uint64_t value2, borrow_t& borrow) noexcept { uint64_t result = value1 - value2 - borrow; borrow = result >= value1; return result; } #if defined(_M_X64) || defined(__amd64__) #include <x86intrin.h> // visual studio has _subborrow_u64 in <intrin.h> // https://learn.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list?view=msvc-170 // uint64_t subtract_borrow_intel(const uint64_t value1, const uint64_t value2, uint8_t& borrow) { uint64_t result; borrow = _subborrow_u64(borrow, value2, value1, (unsigned long long *)&result); return result; } #endif

It's entirely possible; I did construct these to be analogues of the add_overflow() implementation for the given architecture (i.e. pulling from the same libraries and using the same #ifdefs)

jkeiser · 2023-08-30T17:05:52Z

@lemire I made some more variants in this Godbolt. Just based on manual inspection of the assembly, if I had to choose a subtract_borrow implementation, I would choose the clang one, because:

The manual version is a tiny bit longer (though neither seems particularly bad).
__builtin_subcll_overflow(a, b + borrow) produces significantly shorter code than __builtin_subcll(a, b, borrow). This is very strange.
Storing the overflow in a bool is universally shorter, in part because they are guaranteed to be 0 and 1 and therefore can be set to a flag; and in part because flags can be quickly moved to bools but not to 64-bit values.
Other than that, the builtins are slightly shorter than manual, but really not by very much.

uint64_t subtract_borrow_using_overflow_bool(const uint64_t value1, const uint64_t value2, bool& borrow) {
  unsigned long long result;
  borrow = __builtin_usubll_overflow(value1, value2 + borrow, &result);
  return result;
}

uint64_t subtract_borrow_intel_bool(const uint64_t value1, const uint64_t value2, bool& borrow) {
    unsigned long long result;
    borrow = _subborrow_u64(borrow, value1, value2, (unsigned long long *)&result);
    return result;
}

uint64_t subtract_borrow_manual_bool(const uint64_t value1, const uint64_t value2, bool& borrow) noexcept {
  uint64_t result = value1 - value2 - borrow;
  borrow = result >= value1;
  return result;
}

subtract_borrow_using_overflow_bool(unsigned long, unsigned long, bool&):
        # result = value1 - value2 - overflow
        mov     rax, rdi # value1
        movzx   ecx, byte ptr [rdx] # overflow
        add     rcx, rsi # overflow + value2
        sub     rax, rcx # value1 - (overflow + value2)

        # overflow = result >= value1
        setb    byte ptr [rdx]

subtract_borrow_intel_bool(unsigned long, unsigned long, bool&):
        # result = value1 - value2 - overflow
        mov     rax, rdi # value1
        movzx   ecx, byte ptr [rdx] # overflow
        add     cl, -1 # cl = overflow - 1 ???
        sbb     rax, rsi # value1 - value2 - overflow

        # overflow = result >= value1
        setb    byte ptr [rdx]

subtract_borrow_manual_bool(unsigned long, unsigned long, bool&):
        # result = value1 - (value2 + overflow)
        movzx   ecx, byte ptr [rdx] # overflow
        add     rcx, rsi # overflow + value2
        sub     rax, rcx # value1 - (value2 + overflow)

        # overflow = (result >= value1)
        cmp     rax, rdi
        setae   byte ptr [rdx]

jkeiser · 2023-08-30T18:07:30Z

Added some comments in the assembly for easier following.

Bottom line on Ice Lake, once you use bool overflow:

value1 - value2 - borrow; overflow = result >= value1 is the best. It produces the same number of instructions as the others, but two of them are purely for computing overflow--only 3 instructions (with lower latency) are required to compute the result.
__builtin_usubll_overflow(value1, value2 + borrow) is the second best, with the same number of instructions but a longer chain to calculate the result.
_subborrow_u64(borrow, value1, value2) is the worst, acting similarly to __builtin_usubll_overflow but using SBB, which runs on only 2 ports instead of SUB's 4.

Of course, running these in a performance test is the only way to know for sure, since the processor does some minor JIT-ish activities :)

jkeiser · 2023-08-30T18:12:59Z

On ARM, it looks like __builtin_usubll_overflow(value1, value2 + borrow) is the winner, with __builtin_subcll once again producing more instructions, and the manual version producing a lot more instructions.

jkeiser · 2023-08-30T18:25:38Z

And on GCC 13.2 Intel, everything pretty much looks the same as each other. _subborrow_u64(0, value1, value2 + borrow) wins by not producing an extra AND. _subborrow_u64(borrow, value1, value2) loses because of SBB, as with clang.

subtract_borrow_using_overflow_bool(unsigned long, unsigned long, bool&):
        movzx   ecx, BYTE PTR [rdx]
        mov     rax, rdi
        add     rcx, rsi
        sub     rax, rcx

        setb    BYTE PTR [rdx]
        and     BYTE PTR [rdx], 1

subtract_borrow_manual_bool(unsigned long, unsigned long, bool&):
        movzx   ecx, BYTE PTR [rdx]
        mov     rax, rdi
        sub     rax, rsi
        sub     rax, rcx

        cmp     rax, rdi
        setnb   BYTE PTR [rdx]

subtract_borrow_intel_bool(unsigned long, unsigned long, bool&):
        movzx   ecx, BYTE PTR [rdx]
        mov     rax, rdi
        add     cl, -1
        sbb     rax, rsi

        setc    BYTE PTR [rdx]

subtract_borrow_using_overflow_intel_bool(unsigned long, unsigned long, bool&):
        movzx   ecx, BYTE PTR [rdx]
        mov     rax, rdi
        add     rcx, rsi
        sub     rax, rcx

        setb    BYTE PTR [rdx]

jkeiser · 2023-08-30T18:49:36Z

Changing subtract_borrow to the manual version on Ice Lake brings stage 1 down from 93.2353 -> 93.0503 instructions/block and 28.4788 -> 27.7167 cycles/block, a nearly 3% speed improvement. I'll see if it can rescue the speculative string parser, as well.

jkeiser force-pushed the jkeiser/comma branch from 3f1b0c3 to 407f6f8 Compare August 29, 2023 22:22

jkeiser added 17 commits August 29, 2023 18:25

Make simd constexpr

78b4c0a

Define single simd_t for easier copy/paste

a88ad51

Add eq_any()

075bfb1

Use new eq_any for classification

a8635c9

Add byte_classifier abstraction to make lookup tables readable.

c6e43a6

Initial speculative parsing scanner

54cbebf

Move bitmask methods into namespace, add subtract_borrow_out

f594a49

Add no_bits_set to simd8x64

70674d2

Fix scanner to use actual bitmask/simd methods

32afd34

More betterer simd

4b11483

Better bitmask subtraction

589ef23

Update latencies

9dc70e8

Update numbers more

5d762fb

Send commas

52b2414

Make it compile

76dd137

Fix a few bugs

a7b7bc2

Restore UTF-8 algorithm to before lookup table

b524e29

jkeiser force-pushed the jkeiser/comma branch from 407f6f8 to ab8a27e Compare August 29, 2023 22:25

lemire reviewed Aug 30, 2023

View reviewed changes

jkeiser added 5 commits August 30, 2023 14:59

Fix subtraction to use fewer instructions

61ef447

Don't send or receive commas

5b7acd2

Move input reading near other input reading

d2d1525

Expose and use real quote, reducing register pressure

319b10a

Centralize classification

eac04c7

jkeiser added 3 commits August 30, 2023 19:26

Move classifications into a struct

6a05439

Consolidate classification methods

e0f906d

Just use the one struct

3960880

jkeiser force-pushed the jkeiser/comma branch from ab8a27e to 573ff46 Compare August 31, 2023 04:23

jkeiser added 3 commits August 31, 2023 00:46

Fix ctrl character detection

04ab4bb

Actually use new algorithm for borrows

0e28a83

Revert string parsing to old algorithm

8149284

jkeiser force-pushed the jkeiser/comma branch from 573ff46 to 8149284 Compare August 31, 2023 05:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't send commas to stage 2, avoid clmul in most cases #2049

Don't send commas to stage 2, avoid clmul in most cases #2049

jkeiser commented Aug 15, 2023 •

edited

lemire Aug 30, 2023

jkeiser Aug 30, 2023

jkeiser commented Aug 30, 2023 •

edited

jkeiser commented Aug 30, 2023

jkeiser commented Aug 30, 2023

jkeiser commented Aug 30, 2023

jkeiser commented Aug 30, 2023

Don't send commas to stage 2, avoid clmul in most cases #2049

Are you sure you want to change the base?

Don't send commas to stage 2, avoid clmul in most cases #2049

Conversation

jkeiser commented Aug 15, 2023 • edited

lemire Aug 30, 2023

Choose a reason for hiding this comment

jkeiser Aug 30, 2023

Choose a reason for hiding this comment

jkeiser commented Aug 30, 2023 • edited

jkeiser commented Aug 30, 2023

jkeiser commented Aug 30, 2023

jkeiser commented Aug 30, 2023

jkeiser commented Aug 30, 2023

jkeiser commented Aug 15, 2023 •

edited

jkeiser commented Aug 30, 2023 •

edited