SSE runtime comparison (gcc 4.6.1)

debasish83 · ‎08-07-2011

I was trying these two code snippets. My arrays are all 16 byte aligned:

inline void vecDotSSE(double * s, double * x, double * y, int n)
{
int ii;
__m128d XMM0 = _mm_setzero_pd();
__m128d XMM1 = _mm_setzero_pd();
__m128d XMM2, XMM3, XMM4, XMM5;
for (ii = 0;ii < (n);ii += 4)
{
XMM2 = _mm_load_pd((x)+ii);
XMM3 = _mm_load_pd((x)+ii+2);
XMM4 = _mm_load_pd((y)+ii);
XMM5 = _mm_load_pd((y)+ii+2);
XMM2 = _mm_mul_pd(XMM2, XMM4);
XMM3 = _mm_mul_pd(XMM3, XMM5);
XMM0 = _mm_add_pd(XMM0, XMM2);
XMM1 = _mm_add_pd(XMM1, XMM3);
}
XMM0 = _mm_add_pd(XMM0, XMM1);
XMM1 = _mm_shuffle_pd(XMM0, XMM0, _MM_SHUFFLE2(1, 1));
XMM0 = _mm_add_pd(XMM0, XMM1);
_mm_store_sd((s), XMM0);
}

inline void vecDot(double * s, double * x, double * y, int n)
{
int i;
*s = 0.;
for (i = 0;i < n;++i)
{
*s += x * y;
}
}

My compile flags:

g++ -Wall -O3 -msse3

These are my runtime numbers on vector of size 1M

SSE : 0.0263s
Non-SSE : 1.87996e-07

Does that even make sense ??

I have seen lot of people on the web complaining about the same problem. I will be trying BLAS from ATLAS and Intel MKL as well for SSE blas runtimes.

Did you guys do something on the FPU performance on the new processors ? Seems FPU are much faster than SSE arithmetic cores.

Thanks.
Deb

debasish83 · ‎08-08-2011

I think I understood what's going on. SSE version is not inlined by GCC why non-SSE version is inlined !

Although when I tried not to inline both of them (took the inline out from both the code) then I am noticing comparable runtimes. Unfortunately I am still not noticing good speedup in the SSE code.

I am trying the embree-1.0beta code from Intel. I will update the results from that experiment.

Thanks.
Deb

debasish83 · ‎08-08-2011

Respective assembly:

vecDotSSE ASM:

.cfi_startproc
xorpd %xmm2, %xmm2
testl %ecx, %ecx
movapd %xmm2, %xmm3
jle .L11
subl $1, %ecx
xorl %eax, %eax
shrl $2, %ecx
mov %ecx, %ecx
addq $1, %rcx
salq $5, %rcx
.p2align 4,,10
.p2align 3
.L12:
movapd (%rsi,%rax), %xmm1
movapd 16(%rsi,%rax), %xmm0
mulpd (%rdx,%rax), %xmm1
mulpd 16(%rdx,%rax), %xmm0
addq $32, %rax
cmpq %rcx, %rax
addpd %xmm1, %xmm3
addpd %xmm0, %xmm2
jne .L12
.L11:
addpd %xmm3, %xmm2
movapd %xmm2, %xmm0
unpckhpd %xmm2, %xmm0
addpd %xmm2, %xmm0
movlpd %xmm0, (%rdi)
ret
.cfi_endproc

vecDot asm:

.cfi_startproc
xorl %r8d, %r8d
testl %ecx, %ecx
movq %r8, (%rdi)
jle .L15
subl $1, %ecx
movq %r8, -8(%rsp)
xorl %eax, %eax
leaq 8(,%rcx,8), %rcx
movsd -8(%rsp), %xmm1
.p2align 4,,10
.p2align 3
.L17:
movsd (%rsi,%rax), %xmm0
mulsd (%rdx,%rax), %xmm0
addq $8, %rax
cmpq %rcx, %rax
addsd %xmm0, %xmm1
movsd %xmm1, (%rdi)
jne .L17
.L15:
rep
ret
.cfi_endproc

It seems to me that gcc -O3 is also using SSE optimization. There are all mmx registers inside vecDot code as well !

I will understand the assembly and figure out why the SSE code is behaving badly.

TimP · ‎08-08-2011

You would require -ffast-math to include sum reduction vectorization. Gcc ought to handle this quite well. gcc 4.6 has cleaned up the list of aggressive optimizations under fast-math, so it's safer than similar optimization with icc or older gcc. If the compiler doesn't automatically perform scalar replacement on *s (it should, if you use __restrict pointers), it's simple enough to write that in your source code.
As for a ridiculously short time; if the compiler can see that you never use the result of a loop, it may optimize it away. This kind of benchmark cheating optimization has been in high demand for decades.

debasish83 · ‎08-08-2011

Thanks Tim for the quick response. Now I am printing the results as well so that compiler can't cheat.

If I use the following compile flag : g++ -Wall -O3 -msse3

SSE version : 2.38
Non-SSE version : 3.99

So there is clearly 30-40% gain.

Now I added the -ffast-math option. Compile flag : g++ -Wall -O3 -msse3 -ffast-math

SSE version : 2.41
Non-SSE version : 2.49

If I look into the assembly the function body for Non-SSE version looks very similar.

Using this option, is now gcc also loop unrolling the non-SSE code?

Although the asm of the non-SSE function does not look very different than the non -ffast-math version.

.cfi_startproc
xorl %r8d, %r8d
testl %ecx, %ecx
movq %r8, (%rdi)
jle .L15
subl $1, %ecx
movq %r8, -8(%rsp)
xorl %eax, %eax
leaq 8(,%rcx,8), %rcx
movsd -8(%rsp), %xmm1
.p2align 4,,10
.p2align 3
.L17:
movsd (%rsi,%rax), %xmm0
mulsd (%rdx,%rax), %xmm0
addq $8, %rax
cmpq %rcx, %rax
addsd %xmm0, %xmm1
movsd %xmm1, (%rdi)
jne .L17
.L15:
rep
ret
.cfi_endproc

debasish83 · ‎08-08-2011

Hi Tim,

This -ffast-math is definitely making non-SSE code comparable to SSE. Can you please let me know what exactly in the -ffast-math that's helping the scalar code??

Thanks
Deb

TimP · ‎08-08-2011

I was assuming you were compiling with SSE option and asking gcc to vectorize (-O3). With -ffast-math -O3, gcc includes auto-vectorization of reductions such as you posted, so you would get within 60% of best possible performance for the loop, without changing your C source code. gcc options -ftree-vectorizer-verbose=n (n >=1) will give you some vectorization diagnostics.
This is equivalent to -fast or #pragma simd reduction() auto-vectorization of icc, with respect to the source code you posted, except that icc will unroll more aggressively to get more performance in the middle range (loop length 100-2000).