I'm optimising some inner loop code where a couple of 1D arrays (all containing doubles) are being multiplied together, the result multiplied by a double constant and accumulated in another array:
134 traceArray += ((double) 2.0)*zWtArrayj*zWt
I'm compiling with the Intel 16 compiler targeting an Ivy Bridge CPU. The compiler optimisation report produced from these compiler switches:
says reference zWt has aligned access and F64 has unaligned access in this loop. Neither traceArray or pevArray are mentioned as aligned or unaligned.
Examining the loop code assembler using Intel Vectorisation Adviser shows the loop is unrolled 4x and vmulpd and vaddpd instructions are being generated for ymm registers. However, there are also vinsertf128x (operating on 256 bit ymm registers) and vmovupdx instructions operating on 128 bit xmm registers rather than the 256bit ymm registers available with AVX.
I'm working my way through the code to convert the buffers from Blitz to C arrays and align the arrays on 32 byte boundaries. I'm also about to test on Haswell's which may see FMA instructions generated.
I'm new to this level of analysis of AVX instructions so I'd be grateful if someone could help with some queries:
a/ Should all relevant arrays be described as unaligned or aligned by the optimisation reporting?
b/ What would the F64 unaligned access be referring to? I wondered if it refers to intermediate buffering of the multiplication of the double constant by the loop invariant zWtArrayj.
c/ Should there be vinsertf128x and vmovupdx instructions?
c/ As the constant is loop invariant and so is zWtArrayj, it would be possible to store these in a 256 bit buffer for all multiplications during this loop. Should I expect this? How would the compiler be expected to handle constants and loop invariants with respect to vectorisation?
Unaligned memory access normally is split for ivb target. If you can assert alignment by assume_aligned or pragma vector aligned... it may not split them. I've noticed the split moves may be faster even on hsw. One might think the multiply by 2 here could be done more efficiently by add.