Solved: Phi seems not fully support AVX512? Any way to do MATRIX transpose?

Hao_L_ · ‎08-14-2015

I found in past topics that mm512_unpacklo_* is not supported on phi. In my own implementation, it seems mm512_permute* and mm512_shuffle* is also not supported. So far all matrix transpose operation in past posts seems implemented by using mm512_swizzle* and mm512_blend* instructions. However, use these two operations requires two times more element movement, seems low efficiency. Is their any other choices to do matrix transpose?

McCalpinJohn · ‎08-16-2015

The Xeon Phi permute instruction is defined for 32-bit variables, but you can just move them in pairs for 64-bit data.

I have not looked at the tradeoffs of using VLOADUNPACKLO and VLOADUNPACKHI for parts of the transpose step, but since they can move data across the whole width of the vector register, it makes sense that they would be helpful.

The GATHER instruction on Xeon Phi is really a "gather step" instruction. The instruction is only guaranteed to load one element per cycle, so it needs to be in a loop that repeats until all elements have been gathered. The number of steps will only be less than the total number of elements if multiple indices point to data in the same cache line. In the transpose case this will never happen, so it will almost certainly take 8 iterations to load 8 data items. It does put them in the right place in the vector register, so it might end up being the fastest approach.

With the "mainstream" Xeon processors, much of the optimization is related to trying to use the multiple functional units in parallel. You can't really do this on Xeon Phi, since it is limited to one vector instruction per cycle. For data that is in the L2 cache or beyond, you can dual-issue vector instructions with vector prefetches. (Despite the name, "vector prefetches" on Xeon Phi are not vector instructions -- they just move data into the cache, so they don't need access to the vector registers.)

I use the Xeon Phi ISA manual instead of the intrinsics manual to look at instruction options, but with the large number of swizzle and conversion options, I find the intrinsics to be much easier to use than assembly language. It looks like there is a copy of the ISA manual at https://software.intel.com/sites/default/files/forum/278102/327364001en.pdf

View solution in original post

Sunny_G_Intel · ‎08-14-2015

Hello Hao,

Yes you are correct, mm512_unpacklo_* is part of Intel(R) AVX512 Instruction set and is not supported on current generation Intel(R) Xeon Phi(TM) x100 coprocessors. It will be supported on future Intel architectures. Moreover, you can find some related material on Matrix Transpose on Intel Xeon Phi coprocessors in the following post:

https://software.intel.com/en-us/forums/topic/391162

Thanks,

McCalpinJohn · ‎08-14-2015

The low-level instructions should only have a significant impact on the performance of transposition operations for fairly small arrays. Once the data set size exceeds the L2 cache, the performance will depend much more on how the implementation is blocked for cache and TLB re-use.

For L1 or L2-contained transposes where the core performance matters, you should be able to combine the swizzle functionality available to Xeon Phi load instructions and register-to-register moves to permute groups of 4 items so that they will be in the correct slot for a blend operation. The VBLEND instructions can also include swizzles, which might be able to help reduce instruction count. The VPERMD instruction can then be used to for the data lane shifts that are outside of groups of 4 items.

There is no way to avoid reloading the data a lot of times. The trick is to find the balance between reloading the data and permuting it in registers.

Hao_L_ · ‎08-14-2015

Thank you SUNNY, I this post is also useful to me. Butin current stage, my main concern of matrix transpose is in instruction level, say 16x16 matrix that can be directly load into vpu. How to do this efficiently is my major problem. I think john catch my question. :)

SUNNY G. (Intel) wrote:

Hello Hao,

Yes you are correct, mm512_unpacklo_* is part of Intel(R) AVX512 Instruction set and is not supported on current generation Intel(R) Xeon Phi(TM) x100 coprocessors. It will be supported on future Intel architectures. Moreover, you can find some related material on Matrix Transpose on Intel Xeon Phi coprocessors in the following post:

https://software.intel.com/en-us/forums/topic/391162

Thanks,

Hao_L_ · ‎08-14-2015

Thank you John! This is my question! But I still have some questions about the instruction set and transpose operations.

First is about instruction number. Use swizzle with blend could permute items, but it takes one more data movement compared with use unpacklo/unpackhi, right? In past posts people concern about instruction numbers, but it is the number of data movement influence the performance, right? when you say use VBLEND include swizzles, do you mean something like _mm512_blend_epi64(mask, item0, _mm512_swizzle_epi64(item1, SWIZ_ENUM))? In this way the # of data movement seems does not reduce, we just reduce the # of code lines.

Second question is about VPERMD, on Phi I only find one instruction _mm512_permutevar_epi32, epi64 or double items are not supported, so how to permute items out of 4 items? I stoped in this step when I try to transpose a 8x8 matrix.

Third question, currently I write instructions based on the API doc here:https://software.intel.com/sites/landingpage/IntrinsicsGuide, but I found it is not precise, for example _mm512_max_epi64() is not listed in KNC category, but it works on Phi. Could you give me some better resource?

Actually my current solution for 8x8 int64 matrix transpose is directly employee a lookup table use the _mm512_i64extgather_epi64 instruction. I directly read the address of the right item from the array. In this way we only need 8 gather instructions. It works but I have not test its performance compared with solution using swizzle instructions, I do not know if this instruction is much slower compared with mm_load instruction. I will try to test it later.

Looking forward your reply!

John D. McCalpin wrote:

The low-level instructions should only have a significant impact on the performance of transposition operations for fairly small arrays. Once the data set size exceeds the L2 cache, the performance will depend much more on how the implementation is blocked for cache and TLB re-use.

For L1 or L2-contained transposes where the core performance matters, you should be able to combine the swizzle functionality available to Xeon Phi load instructions and register-to-register moves to permute groups of 4 items so that they will be in the correct slot for a blend operation. The VBLEND instructions can also include swizzles, which might be able to help reduce instruction count. The VPERMD instruction can then be used to for the data lane shifts that are outside of groups of 4 items.

There is no way to avoid reloading the data a lot of times. The trick is to find the balance between reloading the data and permuting it in registers.

McCalpinJohn · ‎08-16-2015

The Xeon Phi permute instruction is defined for 32-bit variables, but you can just move them in pairs for 64-bit data.

I have not looked at the tradeoffs of using VLOADUNPACKLO and VLOADUNPACKHI for parts of the transpose step, but since they can move data across the whole width of the vector register, it makes sense that they would be helpful.

The GATHER instruction on Xeon Phi is really a "gather step" instruction. The instruction is only guaranteed to load one element per cycle, so it needs to be in a loop that repeats until all elements have been gathered. The number of steps will only be less than the total number of elements if multiple indices point to data in the same cache line. In the transpose case this will never happen, so it will almost certainly take 8 iterations to load 8 data items. It does put them in the right place in the vector register, so it might end up being the fastest approach.

With the "mainstream" Xeon processors, much of the optimization is related to trying to use the multiple functional units in parallel. You can't really do this on Xeon Phi, since it is limited to one vector instruction per cycle. For data that is in the L2 cache or beyond, you can dual-issue vector instructions with vector prefetches. (Despite the name, "vector prefetches" on Xeon Phi are not vector instructions -- they just move data into the cache, so they don't need access to the vector registers.)

I use the Xeon Phi ISA manual instead of the intrinsics manual to look at instruction options, but with the large number of swizzle and conversion options, I find the intrinsics to be much easier to use than assembly language. It looks like there is a copy of the ISA manual at https://software.intel.com/sites/default/files/forum/278102/327364001en.pdf