- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been running some tests on a new server with sandy bridge E5-2670 on,
and I have a problem when running this very simple piece of code:-
#define N 20000000
float a
[...]
for (j=0; j
[...]
I then compile it the following ways:
a) icc -O3 test.c -o mytest xavx -ip -g -traceback -vec-report2 // with auto-vectorization
b) icc -O3 test.c -o mytest xavx -ip -g -traceback -no-vec -vec-report2 // without auto-vectorization
and surprisingly b) is at least 1.5x faster on average than a).
I spent some time on this problem, looked at the assembler, rewritten it with the intrinsics
but the problem persists.
Could you possibly tell me what I am doing wrong and why is b) faster than a)?
Many Thanks!
Luca
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With that update, I get the following assembly:
With vectorization enabled:
..B1.1: # Preds ..B1.0
xorl %eax, %eax #5.4
..B1.2: # Preds ..B1.2 ..B1.1
vmovups a(,%rax,4), %ymm0
vaddps b(,%rax,4), %ymm0, %ymm1
vmovntps %ymm1, c(,%rax,
addq $8, %rax #5.4
cmpq $20000000, %rax #5.4
jb ..B1.2 # Prob 99% #5.4
..B1.3: # Preds ..B1.2
vzeroupper
ret #7.1
With vectorization disabled:
..B1.1: # Preds ..B1.0
xorl %eax, %eax #5.4
..B1.2: # Preds ..B1.2 ..B1.1
vmovss a(,%rax,8), %xmm0
vmovss 4+a(,%rax,8), %xmm2
vaddss b(,%rax,8), %xmm0, %xmm1
vaddss 4+b(,%rax,8), %xmm2, %xmm3
vmovss %xmm1, c(,%rax,
vmovss %xmm3, 4+c(,%rax,
incq %rax #5.4
cmpq $10000000, %rax #5.4
jb ..B1.2 # Prob 99%
..B1.3: # Preds ..B1.2
ret #7.1
You can see from the asm with vectorization enabled should be a loop of 6 instructions vs. 9 instructions that is iterated 8x fewer times, so we should be seeing considerable speedup. Are you seeing similar assembly generated on your end?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
the compiler I use atm is: icc (ICC) 12.1.3 20120212
so that should be okay.
the assembly is the same when vectorization is on:
xorl %eax, %eax #329.2
.loc 1 324 is_stmt 1
.loc 1 330 is_stmt 1
vmovups a(,%rax,4), %ymm0 #330.18
vaddps b(,%rax,4), %ymm0, %ymm1 #330.18
vmovntps %ymm1, c(,%rax,4) #330.6
.loc 1 329 is_stmt 1
addq $8, %rax #329.2
cmpq $20000000, %rax #329.2
jb ..B6.2 # Prob 99% #329.2
.loc 1 356 is_stmt 1
vzeroupper #356.1
ret #356.1
However, without vectorization I have one alignment instruction (2nd line) more than you:
xorl %eax, %eax #329.2
.align 16,0x90
.loc 1 324 is_stmt 1
.loc 1 330 is_stmt 1
vmovss a(,%rax,8), %xmm0 #330.13
vmovss 4+a(,%rax,8), %xmm2 #330.13
vaddss b(,%rax,8), %xmm0, %xmm1 #330.18
vaddss 4+b(,%rax,8), %xmm2, %xmm3 #330.18
vmovss %xmm1, c(,%rax,8) #330.6
vmovss %xmm3, 4+c(,%rax,8) #330.6
.loc 1 329 is_stmt 1
incq %rax #329.2
cmpq $10000000, %rax #329.2
jb ..B6.2 # Prob 99% #329.2
.loc 1 356 is_stmt 1
ret
--
Re- alignment, I tried to declare the three arrays like that already:-
__declspec(align(32)) float a
but did not help really.
--
The whole thing is bizarre: I have another server with westmere (same clock-freq as
this sandy bridge) and this simple exercise behaves as I expected on that machine
(clearly without AVX). That is, the vectorized code is faster. On this new server
instead is the opposite!
One more thing: in the process of testing this node, at some point I had only one cpu (E5-2670)
mounted on this new server (as it can hold 2 cpus), and sudently the vectorized code was faster
than the not vectorized code.
Then I mounted the second chip and again the odd behaviour reappeared.
Is there a hardware problem do you think? or a settings that I missed?
Thanks, Luca
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How does SSE2 code run, with and without #pragma vector nontemporal ?
Why don't you initialize your data, to avoid possibility of arithmetic faults?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That was done in the omitted part (see my [....] before the for loop)
>How does SSE2 code run, with and without #pragma vector nontemporal ?
Okay, I have seen some improvement using "#pragma vector temporal" instead.
In this way, my test seems to go lot faster, certainly faster than before!
However, the same code then run slower on Westmere, so for me this is not really
a solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1) I wrote "#pragma vector temporal", not nontemporal.
notemporal does not seem to help at all in terms of performance. But yes, it runs OK.
2) you suggested me to test on sse2. To achieve that I compiled it like this:
a) icc -O3 test.c -o mytest -msse2 -axSSSE3,SSE4.1,AVX -ip -g -traceback -vec-report2
// with vectorization
b) icc -O3 test.c -o mytest -msse2 -axSSSE3,SSE4.1,AVX -ip -g -traceback -no-vec -vec-report2
// without vectorization
this way on Westemere I don't use AVX but SSE while on Sandy Bridge it should use AVX.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was able to reproduce this difference in performance. I've discovered that using the compiler option "-opt-streaming-stores never" really boosts the performance of the vectorized code. It's counter-intuitive to me though, since this seems like the kind of memory-bound code that would benefit from streaming stores. I've submitted a problem report to our code generator team to have them take a look at what is going on here. I'll update the thread when we make any progress on investigating this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Luca,
Our developers have been looking into this for sometime, but there are platforms/cpus where this code performs better and where it performs worse, so there isn't an obvious solution to change this one way or the other. Since you know this affects you, continue using -qopt-streaming-stores=never or add the "temporal" clause to your vector pragma. I think that's the best we can do going forward. If you have further questions, let us know.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page