Software Archive
Read-only legacy content
17061 Discussions

Swizzle data in KNC is not efficient enough

Hien_P_
Beginner
290 Views

Hi all,

It seems to me that the swizzle instruction of  KNC intrinsic (https://software.intel.com/sites/landingpage/IntrinsicsGuide/) is not efficient enough to move, shuffle, unpack and permute data, comparing to AVX and SSE.

I am porting code from SSE to KNC (working with float data only) to make it run on Intel Xeon Phi coprocessor but there are not any instruction, such as permuting all 16 float of a _m512 register. Due to the inefficiency of KNC, it seems to me that I can't port the code as I want.

Any help for this please?

Thanks.

0 Kudos
1 Solution
Kevin_D_Intel
Employee
290 Views

Please pardon the delayed reply. We overlooked this post earlier. The information from our Intrinsic Developer regarding your question is:

KNC has VPERMD instruction, which permutes 16 elements of a 512-bit vector. Corresponding intrinsic is:

     __m512i _mm512_permutevar_epi32 (__m512i idx, __m512i a)

It operates on int32 vectors (__m512i) but it is easy to treat them as float32 using "cast" intrinsics, such as _mm512_castsi512_ps and _mm512_castps_si512.

Please let us know if you have additional questions.

View solution in original post

0 Kudos
2 Replies
Kevin_D_Intel
Employee
291 Views

Please pardon the delayed reply. We overlooked this post earlier. The information from our Intrinsic Developer regarding your question is:

KNC has VPERMD instruction, which permutes 16 elements of a 512-bit vector. Corresponding intrinsic is:

     __m512i _mm512_permutevar_epi32 (__m512i idx, __m512i a)

It operates on int32 vectors (__m512i) but it is easy to treat them as float32 using "cast" intrinsics, such as _mm512_castsi512_ps and _mm512_castps_si512.

Please let us know if you have additional questions.

0 Kudos
McCalpinJohn
Honored Contributor III
290 Views

Not surprisingly, the VPERMD instruction on Xeon Phi pays a bit of a latency penalty in order to provide complete generality -- the references I have seen report a six-cycle latency for this instruction.

For systems that support multiple vector instructions per cycle, the best performance for vector "rearrangement" operations may be obtained using a combination of register-to-register instructions and reloading data from memory with a different offset.  This is discussed in the context of the Sandy Bridge AVX implementation in Section 11.11 of the Intel Optimization Reference Manual (document 248966-030).    Most of these optimizations require that the data motion be known at compile time, but that is often the case (e.g., transpositions).

0 Kudos
Reply