Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
1135 Discussions

ICPC 13.0.2 generates scalar load instead of packed load

Paul_S_
Beginner
509 Views

Hi all,

I'm a little puzzled about the generated assembly code for this little piece of Cilk code:

void gemv(const float* restrict A[4], const float *restrict x, float * restrict y){
    __assume_aligned(y, 32);
    __assume_aligned(x, 32);
    __assume_aligned(A, 32);
    y[0:4]  = A[0:4][0] * x[0];
    y[0:4] += A[0:4][1] * x[1];
    y[0:4] += A[0:4][2] * x[2];
    y[0:4] += A[0:4][3] * x[3];
}

Looking at the generated assembly code:

- The compiler changes the algorithm such that it uses the vdpps instruction (most likely due to the bad access pattern of A).  |
- Loads for A are okay (only four packed loads). However, the loads and stores for x and y are quite bad. The compiler issues four scalar loads/ stores for both x and y. More precisely, here is a sequence of the generated scalar loads for x:

vmovss    xmm0, DWORD PTR [rsi]                         
vmovss    xmm1, DWORD PTR [4+rsi]                       
vmovss    xmm2, DWORD PTR [8+rsi]                       
vmovss    xmm3, DWORD PTR [12+rsi]
vunpcklps xmm4, xmm0, xmm1                              
vunpcklps xmm5, xmm2, xmm3                              
vmovlhps  xmm12, xmm4, xmm5

(This code was generated with the Intel icpc 13.0.2 compiler with the following flags: "-xAVX -O3 -S -masm=intel -restrict")

My question is the following: 

Why does the compiler generate this sequence of instructions instead of issuing a single vmovups?

Thank you.
Best,
Paul

0 Kudos
0 Replies
Reply