Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
Intel Customer Support will be observing the Martin Luther King holiday on Monday, Jan. 17, and will return on Tues. Jan. 18.
For the latest information on Intel’s response to the Log4j/Log4Shell vulnerability, please see Intel-SA-00646

Transpose for 4x4 Matrix(double)


Hello, every one

I have a problem about data transpose for 4x4 matrix, which is double-precision ​data

I use _MM_SHUFFLE_PS to transpose float data before, but it is 4x4 single-precision data

I didn't find similar Macro for 4x4 double-precision data.

Now, I use AVX instruction to implement this transpose as following:

    vunpcklpd              ymm4,                           ymm0,                             ymm1
    vunpckhpd             ymm5,                           ymm0,                             ymm1
    vunpcklpd              ymm6,                           ymm2,                             ymm3
    vunpckhpd             ymm7,                           ymm2,                             ymm3
    vperm2f128            ymm0,                           ymm4,                             ymm6,                             020H
    vperm2f128            ymm1,                           ymm5,                             ymm7,                             020H
    vperm2f128            ymm2,                           ymm4,                             ymm6,                             031H
    vperm2f128            ymm3,                           ymm5,                             ymm7,                             031H

I can implement transpose through above instruction, but, I found the vperm2f128's effeciency is very bad, so, my transpose is very slow!

So, I want to know how to optimize the transpose for double ?

Is there any one would like to tell me how to do?

0 Kudos
1 Reply
Black Belt

This general topic is discussed in detail in the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (Intel document 248966-037, July 2017).   Section 12.11 talks about transposes in particular, and how to reduce the dependence on "Port 5".  There is additional discussion in sections 12.16 and 13.8.