Hello, every one
I have a problem about data transpose for 4x4 matrix, which is double-precision data
I use _MM_SHUFFLE_PS to transpose float data before, but it is 4x4 single-precision data
I didn't find similar Macro for 4x4 double-precision data.
Now, I use AVX instruction to implement this transpose as following:
vunpcklpd ymm4, ymm0, ymm1
vunpckhpd ymm5, ymm0, ymm1
vunpcklpd ymm6, ymm2, ymm3
vunpckhpd ymm7, ymm2, ymm3
vperm2f128 ymm0, ymm4, ymm6, 020H
vperm2f128 ymm1, ymm5, ymm7, 020H
vperm2f128 ymm2, ymm4, ymm6, 031H
vperm2f128 ymm3, ymm5, ymm7, 031H
I can implement transpose through above instruction, but, I found the vperm2f128's effeciency is very bad, so, my transpose is very slow!
So, I want to know how to optimize the transpose for double ?
Is there any one would like to tell me how to do?
This general topic is discussed in detail in the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (Intel document 248966-037, July 2017). Section 12.11 talks about transposes in particular, and how to reduce the dependence on "Port 5". There is additional discussion in sections 12.16 and 13.8.