Showing results for 
Search instead for 
Did you mean: 

ICPC 13.0.2 generates scalar load instead of packed load

Hi all,

I'm a little puzzled about the generated assembly code for this little piece of Cilk code:

void gemv(const float* restrict A[4], const float *restrict x, float * restrict y){
    __assume_aligned(y, 32);
    __assume_aligned(x, 32);
    __assume_aligned(A, 32);
    y[0:4]  = A[0:4][0] * x[0];
    y[0:4] += A[0:4][1] * x[1];
    y[0:4] += A[0:4][2] * x[2];
    y[0:4] += A[0:4][3] * x[3];

Looking at the generated assembly code:

- The compiler changes the algorithm such that it uses the vdpps instruction (most likely due to the bad access pattern of A).  |
- Loads for A are okay (only four packed loads). However, the loads and stores for x and y are quite bad. The compiler issues four scalar loads/ stores for both x and y. More precisely, here is a sequence of the generated scalar loads for x:

vmovss    xmm0, DWORD PTR [rsi]                         
vmovss    xmm1, DWORD PTR [4+rsi]                       
vmovss    xmm2, DWORD PTR [8+rsi]                       
vmovss    xmm3, DWORD PTR [12+rsi]
vunpcklps xmm4, xmm0, xmm1                              
vunpcklps xmm5, xmm2, xmm3                              
vmovlhps  xmm12, xmm4, xmm5

(This code was generated with the Intel icpc 13.0.2 compiler with the following flags: "-xAVX -O3 -S -masm=intel -restrict")

My question is the following: 

Why does the compiler generate this sequence of instructions instead of issuing a single vmovups?

Thank you.

0 Kudos
0 Replies