Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Zekun_Y_
Beginner
224 Views

How to convert two __m256d to one __m512d using intrinsics

Hi,

I need to convert two __m256d variables to one __m512d variable. 

For example, __m256d vA  holds {0,1,2,3} and __m256d vB holds {4,5,6,7}, then I want to covert vA and vB to __m512d vC which holds {0,1,2,3,4,5,6,7}.

Is there any efficient way to do this using AVX512 intrinsics?

Thank you!

0 Kudos
5 Replies
McCalpinJohn
Black Belt
224 Views

The _mm512_mask_shuffle_f64x2 intrinsic generates the VSHUFF64x2 instruction, which can do what you want.

The intrinsic expects __m512d inputs, but it should be possible to cast the __m256d inputs to __mm512d types in the argument list for the intrinsic function. 

andysem
New Contributor III
224 Views

You can use _mm512_castpd256_pd512 and _mm512_castpd_si512 to convert one of the arguments to __m512i and then _mm512_inserti64x4 to insert the other argument into the high half of the __m512i and lastly use _mm512_castsi512_pd to cast back to __m512d.

I recommend using Intrinsics Guide (https://software.intel.com/sites/landingpage/IntrinsicsGuide) to find the needed intrinsics.

 

areid2
New Contributor I
224 Views

For the second (insert) solution you shouldn't have to cast to integer data types, which might have a small performance penalty. You can just use:

__m256d a;
__m256d b;
__m512d c = _mm512_insertf64x4(_mm512_castpd256_pd512(a), b, 1);

Unless I misunderstand your notation, I don't think that you actually need to shuffle the vector elements. In this case the insert solution seems slightly better than the shuffle solution.

andysem
New Contributor III
224 Views

areid wrote:

For the second (insert) solution you shouldn't have to cast to integer data types, which might have a small performance penalty.

Casting is a no-op, it has no penalty. Domain transition could potentially add some penalty, but I don't think current architectures add it. But you are right, I missed _mm512_insertf64x4, which would make the code cleaner. It's interesting that this intrinsic is provided only for pd (packed double) data and not for ps; you would still have to do casts in that case.

areid2
New Contributor I
224 Views

Yeah, it was the _mm512_inserti64x4 that I thought might have a small penalty mixing with floating point instructions. I wasn't sure if that instruction would be handled by a different execution unit in some hardware.

Reply