Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support gather for different sizes of types on data and indices #751

Open
Yuhta opened this issue May 20, 2022 · 2 comments
Open

Support gather for different sizes of types on data and indices #751

Yuhta opened this issue May 20, 2022 · 2 comments

Comments

@Yuhta
Copy link
Contributor

Yuhta commented May 20, 2022

We are recently using xsimd to make our Velox query evaluation engine portable. One of the gap we found is xsimd does not support gather for different sizes of types on data and indices. For example we have gather of int64 data with int32 indices, and this can be implemented using __m256i as data register and __m128i as index register on AVX2. Is there a way to solve this? You can refer to our implementation for some idea. If you agree with our approach, we can even help integrate the implementation into xsimd.

In our project we implemented a HalfBatch type that can return __m128i on AVX2 and use it. The details can be found here: https://github.com/facebookincubator/velox/blob/main/velox/common/base/SimdUtil.h#L76-L132

Our gather and maskGather implementation: https://github.com/facebookincubator/velox/blob/main/velox/common/base/SimdUtil.h#L134-L268

I am happy to answer any questions you have. And thank you for creating this library, it really helps to allow us to rewrite our SIMD code in a portable and readable manner.

@serge-sans-paille
Copy link
Contributor

I'm not sure about the HalfBatch, but if I ere to implement it in xsimd, I would make it a type adaptor, something alike

xsimd::half_batch<B>::type instead of introducing new batch types. It would map batch<float, avx2> to batch<float, sse4.2>

But the fact that it doesn't have any specialization for sse it disturbs me.

@amyspark
Copy link
Contributor

This is what I did to hand-optimize two cases we use at Krita:

// gather: handmade conversions
template <class A, class V, detail::enable_sized_integral_t<V, 4> = 0>
inline batch<float, A> gather(batch<float, A> const&, double const* src,
batch<V, A> const& index,
requires_arch<avx2>) noexcept
{
const batch<double, A> low(_mm256_i32gather_pd(src, _mm256_castsi256_si128(index.data), sizeof(double)));
const batch<double, A> high(_mm256_i32gather_pd(src, _mm256_extractf128_si256(index.data, 1), sizeof(double)));
return detail::merge_sse(_mm256_cvtpd_ps(low.data), _mm256_cvtpd_ps(high.data));
}
template <class A, class V, detail::enable_sized_integral_t<V, 4> = 0>
inline batch<int32_t, A> gather(batch<int32_t, A> const&, double const* src,
batch<V, A> const& index,
requires_arch<avx2>) noexcept
{
const batch<double, A> low(_mm256_i32gather_pd(src, _mm256_castsi256_si128(index.data), sizeof(double)));
const batch<double, A> high(_mm256_i32gather_pd(src, _mm256_extractf128_si256(index.data, 1), sizeof(double)));
return detail::merge_sse(_mm256_cvtpd_epi32(low.data), _mm256_cvtpd_epi32(high.data));
}

Instead of using separate batch types, I would suggest to SFINAE on the size of the index batch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants