- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
It seems to me that the swizzle instruction of KNC intrinsic (https://software.intel.com/sites/landingpage/IntrinsicsGuide/) is not efficient enough to move, shuffle, unpack and permute data, comparing to AVX and SSE.
I am porting code from SSE to KNC (working with float data only) to make it run on Intel Xeon Phi coprocessor but there are not any instruction, such as permuting all 16 float of a _m512 register. Due to the inefficiency of KNC, it seems to me that I can't port the code as I want.
Any help for this please?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please pardon the delayed reply. We overlooked this post earlier. The information from our Intrinsic Developer regarding your question is:
KNC has VPERMD instruction, which permutes 16 elements of a 512-bit vector. Corresponding intrinsic is:
__m512i _mm512_permutevar_epi32 (__m512i idx, __m512i a)
It operates on int32 vectors (__m512i) but it is easy to treat them as float32 using "cast" intrinsics, such as _mm512_castsi512_ps and _mm512_castps_si512.
Please let us know if you have additional questions.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please pardon the delayed reply. We overlooked this post earlier. The information from our Intrinsic Developer regarding your question is:
KNC has VPERMD instruction, which permutes 16 elements of a 512-bit vector. Corresponding intrinsic is:
__m512i _mm512_permutevar_epi32 (__m512i idx, __m512i a)
It operates on int32 vectors (__m512i) but it is easy to treat them as float32 using "cast" intrinsics, such as _mm512_castsi512_ps and _mm512_castps_si512.
Please let us know if you have additional questions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not surprisingly, the VPERMD instruction on Xeon Phi pays a bit of a latency penalty in order to provide complete generality -- the references I have seen report a six-cycle latency for this instruction.
For systems that support multiple vector instructions per cycle, the best performance for vector "rearrangement" operations may be obtained using a combination of register-to-register instructions and reloading data from memory with a different offset. This is discussed in the context of the Sandy Bridge AVX implementation in Section 11.11 of the Intel Optimization Reference Manual (document 248966-030). Most of these optimizations require that the data motion be known at compile time, but that is often the case (e.g., transpositions).
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page