Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX512F instructions #3

Open
manodeep opened this issue Apr 4, 2018 · 6 comments
Open

AVX512F instructions #3

manodeep opened this issue Apr 4, 2018 · 6 comments
Assignees
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@manodeep
Copy link

manodeep commented Apr 4, 2018

Hi,

First of all - thanks for creating (and open-sourcing) this swift code! Looks great!

I was looking through the SIMD wrappers for AVX512F in vector.h and I noticed a few wrappers that refer to non-existent intrinsics (at least in AVX512F) or have better implementations. In particular, vec_and maps to _mm512_and_ps, which does not exist (at least according to the Intel Intrinsics Guide). From the looks of it, all and/or operations are now only relevant for masks and not for individual data-types.

I also saw that vec_fabs is implemented via two intrinsics -- is the new _mm512_abs_ps intrinsic too slow?

I am also curious - I do not see any references to any mask(z)_load. I found those masks quite useful for staying in SIMD mode and eliminating the serial part of the code (dealing with remainder loops for array lengths not divisible by the SIMD width).

Once again, the performance gains look awesome!

@JBorrow JBorrow added help wanted Extra attention is needed question Further information is requested labels Apr 4, 2018
@gonnet
Copy link
Contributor

gonnet commented Apr 4, 2018

Hi Manodeep,

Thanks for your feedback!

The macros in vector.h were originally written for SSE2 and subsequently extended for AVX/AVX2/AVX512/AltiVec, mostly via copy-paste, so any inexistant intrinsics would only ever get caught if we tried using them, and I don't think we're using that operation anywhere.

Regarding the vec_fabs macro, I think that's @james-s-willis's code; I'll let him comment on it :)

Cheers, Pedro

@james-s-willis
Copy link

Hi @manodeep,

First of all thanks for the support!

Regarding the vec_and wrapper, you are correct _mm512_and_ps doesn't exist. That wrapper is not actually used anymore and was never used for AVX512, we need to remove it. We mainly use vec_and_mask which maps to _mm512_maskz_mov_ps.

vec_fabs should map to _mm512_abs_ps, we will change that.

Masked loads with mask(z)_load sound interesting. We have not looked at using those for remainder loops but we will now. In your examples do you set the mask to true for the loop iterations divisible by the SIMD length? Which means the instruction reverts to a normal load? And set the mask appropriately for the remainder iterations?

Also, how do you support this functionality in AVX and AVX2 where I am guessing the instructions are not supported?

Thanks,

James

@manodeep
Copy link
Author

manodeep commented Apr 4, 2018

Here's how my SIMD intrinsics work with AVX512F masked loads

Copy-pasting the effective code (note that single and double precision are supported with the following):

/* Stuff in headers */
const uint16_t masks_per_misalignment_value_float[] = {
0b1111111111111111,
0b0000000000000001,
0b0000000000000011,
0b0000000000000111,
0b0000000000001111,
0b0000000000011111,
0b0000000000111111,
0b0000000001111111,
0b0000000011111111,
0b0000000111111111,
0b0000001111111111,
0b0000011111111111,
0b0000111111111111,
0b0001111111111111,
0b0011111111111111,
0b0111111111111111};

const uint8_t masks_per_misalignment_value_double[] = {
0b11111111, 
0b00000001,
0b00000011,
0b00000111,
0b00001111,
0b00011111,
0b00111111,
0b01111111};


#ifdef DOUBLE_PREC
/* calculate in doubles */
#define DOUBLE  double
#define AVX512_NVEC  8
#define AVX512_FLOATS  __m512d
#define AVX512_MASKZ_LOAD_FLOATS_UNALIGNED(MASK, X)    _mm512_maskz_loadu_pd(MASK, X)
#else
/* calculate with floats */
#define DOUBLE float
#define AVX512_NVEC  16
#define AVX512_FLOATS  __m512
#define AVX512_MASKZ_LOAD_FLOATS_UNALIGNED(MASK, X) _mm512_maskz_loadu_ps(MASK, X)
#endif

/* end of stuff in headers */


/* Begin kernel code */
for(int64_t j=n_off;j<N1;j+=AVX512_NVEC) {
    AVX512_MASK m_mask_left = (N1 - j) >= AVX512_NVEC ? ~0:masks_per_misalignment_value_DOUBLE[N1-j];
    /* Perform a mask load -> does not touch any memory not explicitly set via mask */
    const AVX512_FLOATS m_x1 = AVX512_MASKZ_LOAD_FLOATS_UNALIGNED(m_mask_left, localx1);
...
}

Of course such masked loads are not supported by AVX(2). You can mimick such masked loads by implementing partial loads based on the remainder loop. For instance, the partial loads implemented in the vectorclass library by Agner Fog.

@manodeep
Copy link
Author

manodeep commented Apr 4, 2018

Another set of new AVX512F instructions that might be helpful for you guys could be the _mm512_mask(z)_compress_p(s/d) and then a _mm512_mask_reduce_add_p(s/d) (only with intel compilers) for a horizontal sum across the vector register.

@james-s-willis
Copy link

We could make use of masked loads in our code, however we want to support AVX/AVX2 instruction sets. I will look at how Agner Fog implements partial loads.

We use _mm512_mask_compressstoreu_ps to left-pack vectors and _mm512_reduce_add_ps for horizontal adds but have never made use of _mm512_mask(z)_compress_p(s/d) and _mm512_mask_reduce_add_p(s/d). But they could be useful to us.

@manodeep
Copy link
Author

manodeep commented Apr 5, 2018

AFAICS, _mm512_reduce_add_ps operations are a combination of multiple instructions. So it is unclear to me that a loop-unrolling (since the trip-count is fixed) will be much slower. Didn't make too much of difference in my case, and I opted for portability (as in, compilers other than icc) over slight loss of performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants