For various algorithms that require a significant amount of SIMD shuffles to be performed, a performance penalty can occur on both SandyBridge and Haswell uarch's as far as I am aware with it would lead to significant port 5 pressure where the shuffles are performed. I am poured through the Intel 64 Architecture Optimization Manual and stumbled upon a few techniques that would be able to alleviate this issue. These are found in 11.11.2 Design Algorithm with Fewer Shuffles and 11.11.3 Perform Basic Shuffles on Load Ports.
My question is, how do you write your intrinsics so as to:
a) Implement the 8x8 Matrix transpose in section 11.11.2 to use vinsertf128 on the two load ports in SNB uarch bearing in mind that the intrinsic for vinsertf128 does not have a memory operand as argument. Would this do the trick on icpc 16.0?
__m256d v0 = _mm256_castpd128_pd256(_mm_load_pd((double *)&data[pos].var));
v0= _mm256_insertf128_pd(v0,_mm_load_pd((double *)&data[pos].var), 1);
Here for instance, I am loading 128bit double precision values then using a load to insert another 128bit set of values to one of the two lanes and basically remove the need for a vpermf128, is this the intended way of doing this?
b) In section 11.11.3 there is a mention of the ability to performing basic shuffles on the load ports themselves with the use of vmovsldup/vmovshdup instructions. Since our codes are double precision, I presume this only leaves me with vmovddup (_mm256_movedup_pd). Now the documentation mentions that this would work if the source is from memory. Would this do the trick?
Or does it only work in single precision? I have looked at the assembly code but I noticed that the dups are issued as separate instructions after the data was loaded so I do not think it was actually executing it on the load port during the load.
I would be very grateful if someone could give me some hints on the above as I am dealing with some severe port 5 pressure on both SandyBridge and Haswell and I am trying to find some way of alleviating the issue. One avenue would be to do some of these shuffles on the two load ports and then use some blends where possible due to the higher throughput.
One last comment, I've tried using AVX 128 for SandyBridge due to the 128bit bandwidth from L1D per load port however I saw the preformance drop significantly when compared to the 256 loads.
Thank you in advance for your time.
I have had mixed luck with using intrinsics to generate specific instructions. Intrinsics are pretty clearly not intended to be used to completely define the instruction's inputs and outputs -- e.g., you use an _mm512_load_pd intrinsic to load a memory location to an __mm512d variable, then use that variable as an input to _mm512_fmadd_pd intrinsic rather than being able to specify a memory address for the fmadd_pd instruction directly.
The compiler usually does what I expect when there are a few intrinsics in the loop, but when the number gets large I find that the compiler will often completely rewrite my code (including splitting FMAs into separate Multiply and Add instructions).
There are three more levels of control:
Thank you. That does make perfect sense in terms of not being able to use intrinsics as a 1-to-1 replacement for pure assembly instructions. I thought it might have been possible. I guess my best bet is to try and write it as an inline assembly function and see if I can see any benefits from it or not.