- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I need to convert two __m256d variables to one __m512d variable.
For example, __m256d vA holds {0,1,2,3} and __m256d vB holds {4,5,6,7}, then I want to covert vA and vB to __m512d vC which holds {0,1,2,3,4,5,6,7}.
Is there any efficient way to do this using AVX512 intrinsics?
Thank you!
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The _mm512_mask_shuffle_f64x2 intrinsic generates the VSHUFF64x2 instruction, which can do what you want.
The intrinsic expects __m512d inputs, but it should be possible to cast the __m256d inputs to __mm512d types in the argument list for the intrinsic function.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can use _mm512_castpd256_pd512 and _mm512_castpd_si512 to convert one of the arguments to __m512i and then _mm512_inserti64x4 to insert the other argument into the high half of the __m512i and lastly use _mm512_castsi512_pd to cast back to __m512d.
I recommend using Intrinsics Guide (https://software.intel.com/sites/landingpage/IntrinsicsGuide) to find the needed intrinsics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For the second (insert) solution you shouldn't have to cast to integer data types, which might have a small performance penalty. You can just use:
__m256d a; __m256d b; __m512d c = _mm512_insertf64x4(_mm512_castpd256_pd512(a), b, 1);
Unless I misunderstand your notation, I don't think that you actually need to shuffle the vector elements. In this case the insert solution seems slightly better than the shuffle solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
areid wrote:
For the second (insert) solution you shouldn't have to cast to integer data types, which might have a small performance penalty.
Casting is a no-op, it has no penalty. Domain transition could potentially add some penalty, but I don't think current architectures add it. But you are right, I missed _mm512_insertf64x4, which would make the code cleaner. It's interesting that this intrinsic is provided only for pd (packed double) data and not for ps; you would still have to do casts in that case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yeah, it was the _mm512_inserti64x4 that I thought might have a small penalty mixing with floating point instructions. I wasn't sure if that instruction would be handled by a different execution unit in some hardware.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page