Why SSE is slower than ANSI C

leo1981816 · ‎02-21-2006

hi, i writen a simple c progam yesteday, i try to optimize the code with sse, but i just found sse slower than c.

code:

#include
#include
#include

typedef _declspec(align(16)) float vec3_t[3];

inline void vec_normalize_sse(vec3_t vec)
{
_asm {
mov esi, vec

movups xmm0, [esi]
movups xmm1, xmm0
mulps xmm1, xmm1

movups xmm2, xmm1
shufps xmm2, xmm1, 0xe1
movups xmm3, xmm1
shufps xmm3, xmm1, 0xc6

addps xmm1, xmm2
addps xmm1, xmm3

shufps xmm1, xmm1, 0x00

sqrtps xmm1, xmm1
divps xmm0, xmm1

movups [esi], xmm0
}
}

inline void vec_normalize_c(vec3_t vec)
{
float len;

len = vec[0]*vec[0] + vec[1]*vec[1] + vec[2]*vec[2];
len = (float)sqrt(len);
len = 1.0f/len;

vec[0] *= len;
vec[1] *= len;
vec[2] *= len;
}

int main()
{
int i, s, e, count;
vec3_t vec;

count = 1000000;

vec[0] = 1.0f;
vec[1] = 2.0f;
vec[2] = 3.0f;
s = clock();
for (i = 0; i < count; i++) {
vec[0] += 0.1f;
vec[1] += 0.1f;
vec[2] += 0.1f;
vec_normalize_sse(vec);
}
e = clock();
printf("sse = %d, %f, %f, %f ", e - s, vec[0], vec[1], vec[2]);

vec[0] = 1.0f;
vec[1] = 2.0f;
vec[2] = 3.0f;
s = clock();
for (i = 0; i < count; i++) {
vec[0] += 0.1f;
vec[1] += 0.1f;
vec[2] += 0.1f;
vec_normalize_c(vec);
}
e = clock();
printf("c = %d, %f, %f, %f ", e - s, vec[0], vec[1], vec[2]);

getch();
return 0;
}

Dale_S_Intel · ‎03-14-2006

The SSE code you show contains a lot of unaligned movs and shuffles, which can be relatively slow. The only real parallelism in the implementation presented is the multiply to square the elements of the vector, and the division by len, and for that a lot of time is spent shuffling things around. As a result, the compiler is able to do a better job producing code.

I'm not sure what flags you're using to build this code, but I found that I had good luck with -xW (-QxW for windows) which produced some SSE code which ran significantly faster. With this option, the compiler will also produce SSE code, which you may want to compare with your hand coded asm.

I guess the lesson is not to underestimate the intelligence of thecompiler :-)

Let me know if you have any further questions.

Dale

levicki · ‎04-18-2006

Shouldn't this:

typedef _declspec(align(16)) float vec3_t[3];

Actually say:

typedef _declspec(align(16)) float vec3_t[4];

Even though you actually need only 3 elements?

mov esi, vec

Since you have __declspec(align(16))-ed the vec3_t, you can use movaps, it is faster:

movaps xmm0, [esi]

and instead of copying xmm0 to xmm1 and wasting a register:

mulps xmm0, [esi]

and if you bother to set 4th element of vec3_t to 0:

haddps xmm0, xmm0 ; note that this is SSE3 instruction!
haddps xmm0, xmm0

Now you have the len in all 4 elements of xmm0 and you haven't used one shuffle although haddps are far from being inexpensive themself (14 clocks) so you might still consider your version especially if you don't have SSE3 capable CPU.

But here comes the root of your problem:

sqrtps and divps both utilize FP_DIV unit and both have 40 clocks latency and what is worse 40 clocks throughput meaning that while one doesn't finish completely another cannot start.

The only option would be to use FP_DIV for one of the operations and Newton Raphson approximation method for the other (or for both operations) depending on the required precision. You could also do the divide and use table lookups for square root if possible.

p4top · ‎04-19-2006

Reason: sqrtps and divps is too slow, much slow than sqrtss and divss. The real reason is that your asm algorithm is different with c algorithm, and it is too slow.
you can use sse scalar instruction to rewrite your code as your c algorithm. Do not use sqrtps and divps!

And vec[4] better be defined to adapt sse 4-way instruction, or you maybe cause excepts and reduce performance.

If you want to increase the performance more, I suggest you:
Instead of use sqrtss and divss to obtain a=1/sqrt(b)
use those scalar asm algorithm(a=1/sqrt):

rsqrtss a,b
mulss b,a
mulss b,a
subss b,C1
mulss a,b
mulss a,C2

where C1=3.0, C2=-0.5, rsqrtss is a low precision 1/sqrt instruction.

or you can try parallel asm algorithm:
rsqrtps a,b
mulps b,a
mulps b,a
subps b,C1
mulps a,b
mulps a,C2
where C1={3.0,3,3,3}, C2={-0.5,-0.5,-0.5,-0.5}
Message Edited by p4top on 04-19-200608:49 AM

levicki · ‎04-25-2006

I was talking about the instructions used in general, I wasn't suggesting him to use SQRTPS.
Depending on required precision there are much better combinations available.