Hi Luca,

luca_marsiglio · ‎03-19-2012

Hi all,

I have been running some tests on a new server with sandy bridge E5-2670 on,
and I have a problem when running this very simple piece of code:-

#define N 20000000
float a, b, c;
[...]
for (j=0; j c = a+b;
[...]

I then compile it the following ways:

a) icc -O3 test.c -o mytest xavx -ip -g -traceback -vec-report2 // with auto-vectorization
b) icc -O3 test.c -o mytest xavx -ip -g -traceback -no-vec -vec-report2 // without auto-vectorization

and surprisingly b) is at least 1.5x faster on average than a).

I spent some time on this problem, looked at the assembler, rewritten it with the intrinsics
but the problem persists.

Could you possibly tell me what I am doing wrong and why is b) faster than a)?

Many Thanks!
Luca

Brandon_H_Intel · ‎03-19-2012

Luca, what version of the compiler are you using? I would recommend using the latest 12.1 update.

With that update, I get the following assembly:

With vectorization enabled:

..B1.1: # Preds ..B1.0

xorl %eax, %eax #5.4

..B1.2: # Preds ..B1.2 ..B1.1

vmovups a(,%rax,4), %ymm0

vaddps b(,%rax,4), %ymm0, %ymm1

vmovntps %ymm1, c(,%rax,

addq $8, %rax #5.4

cmpq $20000000, %rax #5.4

jb ..B1.2 # Prob 99% #5.4

..B1.3: # Preds ..B1.2

vzeroupper
ret #7.1

With vectorization disabled:

..B1.1: # Preds ..B1.0

xorl %eax, %eax #5.4

..B1.2: # Preds ..B1.2 ..B1.1

vmovss a(,%rax,8), %xmm0

vmovss 4+a(,%rax,8), %xmm2

vaddss b(,%rax,8), %xmm0, %xmm1

vaddss 4+b(,%rax,8), %xmm2, %xmm3

vmovss %xmm1, c(,%rax,

vmovss %xmm3, 4+c(,%rax,

incq %rax #5.4

cmpq $10000000, %rax #5.4

jb ..B1.2 # Prob 99%

..B1.3: # Preds ..B1.2

ret #7.1

You can see from the asm with vectorization enabled should be a loop of 6 instructions vs. 9 instructions that is iterated 8x fewer times, so we should be seeing considerable speedup. Are you seeing similar assembly generated on your end?

Brandon_H_Intel · ‎03-19-2012

You may also want to make sure that a, b, and c are 32-byte aligned.

luca_marsiglio · ‎03-20-2012

Hi Brandon, thanks for your replies!

the compiler I use atm is: icc (ICC) 12.1.3 20120212
so that should be okay.

the assembly is the same when vectorization is on:

xorl %eax, %eax #329.2
.loc 1 324 is_stmt 1
.loc 1 330 is_stmt 1
vmovups a(,%rax,4), %ymm0 #330.18
vaddps b(,%rax,4), %ymm0, %ymm1 #330.18
vmovntps %ymm1, c(,%rax,4) #330.6
.loc 1 329 is_stmt 1
addq $8, %rax #329.2
cmpq $20000000, %rax #329.2
jb ..B6.2 # Prob 99% #329.2
.loc 1 356 is_stmt 1
vzeroupper #356.1
ret #356.1

However, without vectorization I have one alignment instruction (2nd line) more than you:

xorl %eax, %eax #329.2
.align 16,0x90
.loc 1 324 is_stmt 1
.loc 1 330 is_stmt 1
vmovss a(,%rax,8), %xmm0 #330.13
vmovss 4+a(,%rax,8), %xmm2 #330.13
vaddss b(,%rax,8), %xmm0, %xmm1 #330.18
vaddss 4+b(,%rax,8), %xmm2, %xmm3 #330.18
vmovss %xmm1, c(,%rax,8) #330.6
vmovss %xmm3, 4+c(,%rax,8) #330.6
.loc 1 329 is_stmt 1
incq %rax #329.2
cmpq $10000000, %rax #329.2
jb ..B6.2 # Prob 99% #329.2
.loc 1 356 is_stmt 1
ret
--

Re- alignment, I tried to declare the three arrays like that already:-

__declspec(align(32)) float a, b, c;
but did not help really.

--

The whole thing is bizarre: I have another server with westmere (same clock-freq as
this sandy bridge) and this simple exercise behaves as I expected on that machine
(clearly without AVX). That is, the vectorized code is faster. On this new server
instead is the opposite!

One more thing: in the process of testing this node, at some point I had only one cpu (E5-2670)
mounted on this new server (as it can hold 2 cpus), and sudently the vectorized code was faster
than the not vectorized code.
Then I mounted the second chip and again the odd behaviour reappeared.
Is there a hardware problem do you think? or a settings that I missed?

Thanks, Luca

TimP · ‎03-20-2012

If you think the code alignment (.align in the asm) makes a difference, you could change the asm code and build from that.
How does SSE2 code run, with and without #pragma vector nontemporal ?
Why don't you initialize your data, to avoid possibility of arithmetic faults?

luca_marsiglio · ‎03-20-2012

>Why don't you initialize your data, to avoid possibility of arithmetic faults?
That was done in the omitted part (see my [....] before the for loop)

>How does SSE2 code run, with and without #pragma vector nontemporal ?

Okay, I have seen some improvement using "#pragma vector temporal" instead.
In this way, my test seems to go lot faster, certainly faster than before!

However, the same code then run slower on Westmere, so for me this is not really
a solution.

TimP · ‎03-20-2012

So does the SSE2 nontemporal run OK on Sandy Bridge? Surely you can't run AVX vector temporal on the Westmere.

luca_marsiglio · ‎03-20-2012

Hi Tim,

1) I wrote "#pragma vector temporal", not nontemporal.
notemporal does not seem to help at all in terms of performance. But yes, it runs OK.

2) you suggested me to test on sse2. To achieve that I compiled it like this:

a) icc -O3 test.c -o mytest -msse2 -axSSSE3,SSE4.1,AVX -ip -g -traceback -vec-report2
// with vectorization
b) icc -O3 test.c -o mytest -msse2 -axSSSE3,SSE4.1,AVX -ip -g -traceback -no-vec -vec-report2
// without vectorization

this way on Westemere I don't use AVX but SSE while on Sandy Bridge it should use AVX.

TimP · ‎03-20-2012

Yes, but the SSE2 or SSE3 or 4 code (e.g. as above with no AVX option) will run on Sandy Bridge, and may not run into the same performance problem as AVX code. When you request 4 alternate paths as you did above, there's a danger that the compiler doesn't choose exactly the ones you expected, or that the overhead for multiple code paths exceeds the gain.

Brandon_H_Intel · ‎03-20-2012

Hi Luca,

I was able to reproduce this difference in performance. I've discovered that using the compiler option "-opt-streaming-stores never" really boosts the performance of the vectorized code. It's counter-intuitive to me though, since this seems like the kind of memory-bound code that would benefit from streaming stores. I've submitted a problem report to our code generator team to have them take a look at what is going on here. I'll update the thread when we make any progress on investigating this.

Brandon_H_Intel · ‎05-27-2015

Hi Luca,

Our developers have been looking into this for sometime, but there are platforms/cpus where this code performs better and where it performs worse, so there isn't an obvious solution to change this one way or the other. Since you know this affects you, continue using -qopt-streaming-stores=never or add the "temporal" clause to your vector pragma. I think that's the best we can do going forward. If you have further questions, let us know.

have a problem with AVX and autovectorization