- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I'm a little puzzled about the generated assembly code for this little piece of Cilk code:
void gemv(const float* restrict A[4], const float *restrict x, float * restrict y){
__assume_aligned(y, 32);
__assume_aligned(x, 32);
__assume_aligned(A, 32);
y[0:4] = A[0:4][0] * x[0];
y[0:4] += A[0:4][1] * x[1];
y[0:4] += A[0:4][2] * x[2];
y[0:4] += A[0:4][3] * x[3];
}
Looking at the generated assembly code:
- The compiler changes the algorithm such that it uses the vdpps instruction (most likely due to the bad access pattern of A). |
- Loads for A are okay (only four packed loads). However, the loads and stores for x and y are quite bad. The compiler issues four scalar loads/ stores for both x and y. More precisely, here is a sequence of the generated scalar loads for x:
vmovss xmm0, DWORD PTR [rsi]
vmovss xmm1, DWORD PTR [4+rsi]
vmovss xmm2, DWORD PTR [8+rsi]
vmovss xmm3, DWORD PTR [12+rsi]
vunpcklps xmm4, xmm0, xmm1
vunpcklps xmm5, xmm2, xmm3
vmovlhps xmm12, xmm4, xmm5
(This code was generated with the Intel icpc 13.0.2 compiler with the following flags: "-xAVX -O3 -S -masm=intel -restrict")
My question is the following:
Why does the compiler generate this sequence of instructions instead of issuing a single vmovups?
Thank you.
Best,
Paul
Link Copied

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page