How to convert two __m256d to one __m512d using intrinsics

Zekun_Y_ · ‎10-23-2016

Hi,

I need to convert two __m256d variables to one __m512d variable.

For example, __m256d vA holds {0,1,2,3} and __m256d vB holds {4,5,6,7}, then I want to covert vA and vB to __m512d vC which holds {0,1,2,3,4,5,6,7}.

Is there any efficient way to do this using AVX512 intrinsics?

Thank you!

McCalpinJohn · ‎10-24-2016

The _mm512_mask_shuffle_f64x2 intrinsic generates the VSHUFF64x2 instruction, which can do what you want.

The intrinsic expects __m512d inputs, but it should be possible to cast the __m256d inputs to __mm512d types in the argument list for the intrinsic function.

andysem · ‎10-24-2016

You can use _mm512_castpd256_pd512 and _mm512_castpd_si512 to convert one of the arguments to __m512i and then _mm512_inserti64x4 to insert the other argument into the high half of the __m512i and lastly use _mm512_castsi512_pd to cast back to __m512d.

I recommend using Intrinsics Guide (https://software.intel.com/sites/landingpage/IntrinsicsGuide) to find the needed intrinsics.

areid2 · ‎11-23-2016

For the second (insert) solution you shouldn't have to cast to integer data types, which might have a small performance penalty. You can just use:

__m256d a;
__m256d b;
__m512d c = _mm512_insertf64x4(_mm512_castpd256_pd512(a), b, 1);

Unless I misunderstand your notation, I don't think that you actually need to shuffle the vector elements. In this case the insert solution seems slightly better than the shuffle solution.

andysem · ‎11-23-2016

areid wrote:

For the second (insert) solution you shouldn't have to cast to integer data types, which might have a small performance penalty.

Casting is a no-op, it has no penalty. Domain transition could potentially add some penalty, but I don't think current architectures add it. But you are right, I missed _mm512_insertf64x4, which would make the code cleaner. It's interesting that this intrinsic is provided only for pd (packed double) data and not for ps; you would still have to do casts in that case.

areid2 · ‎11-23-2016

Yeah, it was the _mm512_inserti64x4 that I thought might have a small penalty mixing with floating point instructions. I wasn't sure if that instruction would be handled by a different execution unit in some hardware.