ICPC 13.0.2 generates scalar load instead of packed load

Paul_S_ · ‎01-15-2014

Hi all,

I'm a little puzzled about the generated assembly code for this little piece of Cilk code:

void gemv(const float* restrict A[4], const float *restrict x, float * restrict y){
__assume_aligned(y, 32);
__assume_aligned(x, 32);
__assume_aligned(A, 32);
y[0:4] = A[0:4][0] * x[0];
y[0:4] += A[0:4][1] * x[1];
y[0:4] += A[0:4][2] * x[2];
y[0:4] += A[0:4][3] * x[3];
}

Looking at the generated assembly code:

- The compiler changes the algorithm such that it uses the vdpps instruction (most likely due to the bad access pattern of A). |
- Loads for A are okay (only four packed loads). However, the loads and stores for x and y are quite bad. The compiler issues four scalar loads/ stores for both x and y. More precisely, here is a sequence of the generated scalar loads for x:

vmovss xmm0, DWORD PTR [rsi]
vmovss xmm1, DWORD PTR [4+rsi]
vmovss xmm2, DWORD PTR [8+rsi]
vmovss xmm3, DWORD PTR [12+rsi]
vunpcklps xmm4, xmm0, xmm1
vunpcklps xmm5, xmm2, xmm3
vmovlhps xmm12, xmm4, xmm5

(This code was generated with the Intel icpc 13.0.2 compiler with the following flags: "-xAVX -O3 -S -masm=intel -restrict")

My question is the following:

Why does the compiler generate this sequence of instructions instead of issuing a single vmovups?

Thank you.
Best,
Paul