Transpose for 4x4 Matrix(double)

Shaquille_W_ · ‎10-21-2017

Hello, every one

I have a problem about data transpose for 4x4 matrix, which is double-precision data

I use _MM_SHUFFLE_PS to transpose float data before, but it is 4x4 single-precision data

I didn't find similar Macro for 4x4 double-precision data.

Now, I use AVX instruction to implement this transpose as following:

    vunpcklpd              ymm4,                           ymm0,                             ymm1
    vunpckhpd             ymm5,                           ymm0,                             ymm1
    vunpcklpd             ymm6,                           ymm2,                             ymm3
    vunpckhpd             ymm7,                           ymm2,                             ymm3
    vperm2f128            ymm0,                           ymm4,                             ymm6,                             020H
    vperm2f128            ymm1,                           ymm5,                             ymm7,                             020H
    vperm2f128            ymm2,                           ymm4,                             ymm6,                             031H
    vperm2f128            ymm3,                           ymm5,                             ymm7,                             031H

I can implement transpose through above instruction, but, I found the vperm2f128's effeciency is very bad, so, my transpose is very slow!

So, I want to know how to optimize the transpose for double ?

Is there any one would like to tell me how to do?

McCalpinJohn · ‎10-23-2017

This general topic is discussed in detail in the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (Intel document 248966-037, July 2017). Section 12.11 talks about transposes in particular, and how to reduce the dependence on "Port 5". There is additional discussion in sections 12.16 and 13.8.