Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.
1079 Discussions

ICPC 13.0.2 generates scalar load instead of packed load


Hi all,

I'm a little puzzled about the generated assembly code for this little piece of Cilk code:

void gemv(const float* restrict A[4], const float *restrict x, float * restrict y){
    __assume_aligned(y, 32);
    __assume_aligned(x, 32);
    __assume_aligned(A, 32);
    y[0:4]  = A[0:4][0] * x[0];
    y[0:4] += A[0:4][1] * x[1];
    y[0:4] += A[0:4][2] * x[2];
    y[0:4] += A[0:4][3] * x[3];

Looking at the generated assembly code:

- The compiler changes the algorithm such that it uses the vdpps instruction (most likely due to the bad access pattern of A).  |
- Loads for A are okay (only four packed loads). However, the loads and stores for x and y are quite bad. The compiler issues four scalar loads/ stores for both x and y. More precisely, here is a sequence of the generated scalar loads for x:

vmovss    xmm0, DWORD PTR [rsi]                         
vmovss    xmm1, DWORD PTR [4+rsi]                       
vmovss    xmm2, DWORD PTR [8+rsi]                       
vmovss    xmm3, DWORD PTR [12+rsi]
vunpcklps xmm4, xmm0, xmm1                              
vunpcklps xmm5, xmm2, xmm3                              
vmovlhps  xmm12, xmm4, xmm5

(This code was generated with the Intel icpc 13.0.2 compiler with the following flags: "-xAVX -O3 -S -masm=intel -restrict")

My question is the following: 

Why does the compiler generate this sequence of instructions instead of issuing a single vmovups?

Thank you.

0 Kudos
0 Replies