- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

inline void vecDotSSE(double * s, double * x, double * y, int n)

{

int ii;

__m128d XMM0 = _mm_setzero_pd();

__m128d XMM1 = _mm_setzero_pd();

__m128d XMM2, XMM3, XMM4, XMM5;

for (ii = 0;ii < (n);ii += 4)

{

XMM2 = _mm_load_pd((x)+ii);

XMM3 = _mm_load_pd((x)+ii+2);

XMM4 = _mm_load_pd((y)+ii);

XMM5 = _mm_load_pd((y)+ii+2);

XMM2 = _mm_mul_pd(XMM2, XMM4);

XMM3 = _mm_mul_pd(XMM3, XMM5);

XMM0 = _mm_add_pd(XMM0, XMM2);

XMM1 = _mm_add_pd(XMM1, XMM3);

}

XMM0 = _mm_add_pd(XMM0, XMM1);

XMM1 = _mm_shuffle_pd(XMM0, XMM0, _MM_SHUFFLE2(1, 1));

XMM0 = _mm_add_pd(XMM0, XMM1);

_mm_store_sd((s), XMM0);

}

inline void vecDot(double * s, double * x, double * y, int n)

{

int i;

*s = 0.;

for (i = 0;i < n;++i)

{

*s += x

** y*

*;*

}

}

My compile flags:

g++ -Wall -O3 -msse3

These are my runtime numbers on vector of size 1M

SSE : 0.0263s

Non-SSE : 1.87996e-07

Does that even make sense ??

I have seen lot of people on the web complaining about the same problem. I will be trying BLAS from ATLAS and Intel MKL as well for SSE blas runtimes.

Did you guys do something on the FPU performance on the new processors ? Seems FPU are much faster than SSE arithmetic cores.

Thanks.

Deb}

}

My compile flags:

g++ -Wall -O3 -msse3

These are my runtime numbers on vector of size 1M

SSE : 0.0263s

Non-SSE : 1.87996e-07

Does that even make sense ??

I have seen lot of people on the web complaining about the same problem. I will be trying BLAS from ATLAS and Intel MKL as well for SSE blas runtimes.

Did you guys do something on the FPU performance on the new processors ? Seems FPU are much faster than SSE arithmetic cores.

Thanks.

Deb

Link Copied

6 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Although when I tried not to inline both of them (took the inline out from both the code) then I am noticing comparable runtimes. Unfortunately I am still not noticing good speedup in the SSE code.

I am trying the embree-1.0beta code from Intel. I will update the results from that experiment.

Thanks.

Deb

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

vecDotSSE ASM:

.cfi_startproc

xorpd %xmm2, %xmm2

testl %ecx, %ecx

movapd %xmm2, %xmm3

jle .L11

subl $1, %ecx

xorl %eax, %eax

shrl $2, %ecx

mov %ecx, %ecx

addq $1, %rcx

salq $5, %rcx

.p2align 4,,10

.p2align 3

.L12:

movapd (%rsi,%rax), %xmm1

movapd 16(%rsi,%rax), %xmm0

mulpd (%rdx,%rax), %xmm1

mulpd 16(%rdx,%rax), %xmm0

addq $32, %rax

cmpq %rcx, %rax

addpd %xmm1, %xmm3

addpd %xmm0, %xmm2

jne .L12

.L11:

addpd %xmm3, %xmm2

movapd %xmm2, %xmm0

unpckhpd %xmm2, %xmm0

addpd %xmm2, %xmm0

movlpd %xmm0, (%rdi)

ret

.cfi_endproc

vecDot asm:

.cfi_startproc

xorl %r8d, %r8d

testl %ecx, %ecx

movq %r8, (%rdi)

jle .L15

subl $1, %ecx

movq %r8, -8(%rsp)

xorl %eax, %eax

leaq 8(,%rcx,8), %rcx

movsd -8(%rsp), %xmm1

.p2align 4,,10

.p2align 3

.L17:

movsd (%rsi,%rax), %xmm0

mulsd (%rdx,%rax), %xmm0

addq $8, %rax

cmpq %rcx, %rax

addsd %xmm0, %xmm1

movsd %xmm1, (%rdi)

jne .L17

.L15:

rep

ret

.cfi_endproc

It seems to me that gcc -O3 is also using SSE optimization. There are all mmx registers inside vecDot code as well !

I will understand the assembly and figure out why the SSE code is behaving badly.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

As for a ridiculously short time; if the compiler can see that you never use the result of a loop, it may optimize it away. This kind of benchmark cheating optimization has been in high demand for decades.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

If I use the following compile flag : g++ -Wall -O3 -msse3

SSE version : 2.38

Non-SSE version : 3.99

So there is clearly 30-40% gain.

Now I added the -ffast-math option. Compile flag : g++ -Wall -O3 -msse3 -ffast-math

SSE version : 2.41

Non-SSE version : 2.49

If I look into the assembly the function body for Non-SSE version looks very similar.

Using this option, is now gcc also loop unrolling the non-SSE code?

Although the asm of the non-SSE function does not look very different than the non -ffast-math version.

.cfi_startproc

xorl %r8d, %r8d

testl %ecx, %ecx

movq %r8, (%rdi)

jle .L15

subl $1, %ecx

movq %r8, -8(%rsp)

xorl %eax, %eax

leaq 8(,%rcx,8), %rcx

movsd -8(%rsp), %xmm1

.p2align 4,,10

.p2align 3

.L17:

movsd (%rsi,%rax), %xmm0

mulsd (%rdx,%rax), %xmm0

addq $8, %rax

cmpq %rcx, %rax

addsd %xmm0, %xmm1

movsd %xmm1, (%rdi)

jne .L17

.L15:

rep

ret

.cfi_endproc

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

This -ffast-math is definitely making non-SSE code comparable to SSE. Can you please let me know what exactly in the -ffast-math that's helping the scalar code??

Thanks

Deb

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

This is equivalent to -fast or #pragma simd reduction() auto-vectorization of icc, with respect to the source code you posted, except that icc will unroll more aggressively to get more performance in the middle range (loop length 100-2000).

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page