- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi, i writen a simple c progam yesteday, i try to optimize the code with sse, but i just found sse slower than c.
code:
#include
#include
#include
#include
#include
typedef _declspec(align(16)) float vec3_t[3];
inline void vec_normalize_sse(vec3_t vec)
{
_asm {
mov esi, vec
{
_asm {
mov esi, vec
movups xmm0, [esi]
movups xmm1, xmm0
mulps xmm1, xmm1
movups xmm1, xmm0
mulps xmm1, xmm1
movups xmm2, xmm1
shufps xmm2, xmm1, 0xe1
movups xmm3, xmm1
shufps xmm3, xmm1, 0xc6
addps xmm1, xmm2
addps xmm1, xmm3
addps xmm1, xmm3
shufps xmm1, xmm1, 0x00
sqrtps xmm1, xmm1
divps xmm0, xmm1
movups [esi], xmm0
}
}
divps xmm0, xmm1
movups [esi], xmm0
}
}
inline void vec_normalize_c(vec3_t vec)
{
float len;
{
float len;
len = vec[0]*vec[0] + vec[1]*vec[1] + vec[2]*vec[2];
len = (float)sqrt(len);
len = 1.0f/len;
len = (float)sqrt(len);
len = 1.0f/len;
vec[0] *= len;
vec[1] *= len;
vec[2] *= len;
}
vec[1] *= len;
vec[2] *= len;
}
int main()
{
int i, s, e, count;
vec3_t vec;
{
int i, s, e, count;
vec3_t vec;
count = 1000000;
vec[0] = 1.0f;
vec[1] = 2.0f;
vec[2] = 3.0f;
s = clock();
for (i = 0; i < count; i++) {
vec[0] += 0.1f;
vec[1] += 0.1f;
vec[2] += 0.1f;
vec_normalize_sse(vec);
}
e = clock();
printf("sse = %d, %f, %f, %f ", e - s, vec[0], vec[1], vec[2]);
vec[1] = 2.0f;
vec[2] = 3.0f;
s = clock();
for (i = 0; i < count; i++) {
vec[0] += 0.1f;
vec[1] += 0.1f;
vec[2] += 0.1f;
vec_normalize_sse(vec);
}
e = clock();
printf("sse = %d, %f, %f, %f ", e - s, vec[0], vec[1], vec[2]);
vec[0] = 1.0f;
vec[1] = 2.0f;
vec[2] = 3.0f;
s = clock();
for (i = 0; i < count; i++) {
vec[0] += 0.1f;
vec[1] += 0.1f;
vec[2] += 0.1f;
vec_normalize_c(vec);
}
e = clock();
printf("c = %d, %f, %f, %f ", e - s, vec[0], vec[1], vec[2]);
getch();
return 0;
}
return 0;
}
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The SSE code you show contains a lot of unaligned movs and shuffles, which can be relatively slow. The only real parallelism in the implementation presented is the multiply to square the elements of the vector, and the division by len, and for that a lot of time is spent shuffling things around. As a result, the compiler is able to do a better job producing code.
I'm not sure what flags you're using to build this code, but I found that I had good luck with -xW (-QxW for windows) which produced some SSE code which ran significantly faster. With this option, the compiler will also produce SSE code, which you may want to compare with your hand coded asm.
I guess the lesson is not to underestimate the intelligence of thecompiler :-)
Let me know if you have any further questions.
Dale
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Shouldn't this:
typedef _declspec(align(16)) float vec3_t[3];
Actually say:
typedef _declspec(align(16)) float vec3_t[4];
Even though you actually need only 3 elements?
mov esi, vec
Since you have __declspec(align(16))-ed the vec3_t, you can use movaps, it is faster:
movaps xmm0, [esi]
and instead of copying xmm0 to xmm1 and wasting a register:
mulps xmm0, [esi]
and if you bother to set 4th element of vec3_t to 0:
haddps xmm0, xmm0 ; note that this is SSE3 instruction!
haddps xmm0, xmm0
Now you have the len in all 4 elements of xmm0 and you haven't used one shuffle although haddps are far from being inexpensive themself (14 clocks) so you might still consider your version especially if you don't have SSE3 capable CPU.
But here comes the root of your problem:
sqrtps and divps both utilize FP_DIV unit and both have 40 clocks latency and what is worse 40 clocks throughput meaning that while one doesn't finish completely another cannot start.
The only option would be to use FP_DIV for one of the operations and Newton Raphson approximation method for the other (or for both operations) depending on the required precision. You could also do the divide and use table lookups for square root if possible.
typedef _declspec(align(16)) float vec3_t[3];
Actually say:
typedef _declspec(align(16)) float vec3_t[4];
Even though you actually need only 3 elements?
mov esi, vec
Since you have __declspec(align(16))-ed the vec3_t, you can use movaps, it is faster:
movaps xmm0, [esi]
and instead of copying xmm0 to xmm1 and wasting a register:
mulps xmm0, [esi]
and if you bother to set 4th element of vec3_t to 0:
haddps xmm0, xmm0 ; note that this is SSE3 instruction!
haddps xmm0, xmm0
Now you have the len in all 4 elements of xmm0 and you haven't used one shuffle although haddps are far from being inexpensive themself (14 clocks) so you might still consider your version especially if you don't have SSE3 capable CPU.
But here comes the root of your problem:
sqrtps and divps both utilize FP_DIV unit and both have 40 clocks latency and what is worse 40 clocks throughput meaning that while one doesn't finish completely another cannot start.
The only option would be to use FP_DIV for one of the operations and Newton Raphson approximation method for the other (or for both operations) depending on the required precision. You could also do the divide and use table lookups for square root if possible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Reason: sqrtps and divps is too slow, much slow than sqrtss and divss. The real reason is that your asm algorithm is different with c algorithm, and it is too slow.
you can use sse scalar instruction to rewrite your code as your c algorithm. Do not use sqrtps and divps!
And vec[4] better be defined to adapt sse 4-way instruction, or you maybe cause excepts and reduce performance.
If you want to increase the performance more, I suggest you:
Instead of use sqrtss and divss to obtain a=1/sqrt(b)
use those scalar asm algorithm(a=1/sqrt):
rsqrtss a,b
mulss b,a
mulss b,a
subss b,C1
mulss a,b
mulss a,C2
where C1=3.0, C2=-0.5, rsqrtss is a low precision 1/sqrt instruction.
or you can try parallel asm algorithm:
rsqrtps a,b
mulps b,a
mulps b,a
subps b,C1
mulps a,b
mulps a,C2
where C1={3.0,3,3,3}, C2={-0.5,-0.5,-0.5,-0.5}
you can use sse scalar instruction to rewrite your code as your c algorithm. Do not use sqrtps and divps!
And vec[4] better be defined to adapt sse 4-way instruction, or you maybe cause excepts and reduce performance.
If you want to increase the performance more, I suggest you:
Instead of use sqrtss and divss to obtain a=1/sqrt(b)
use those scalar asm algorithm(a=1/sqrt):
rsqrtss a,b
mulss b,a
mulss b,a
subss b,C1
mulss a,b
mulss a,C2
where C1=3.0, C2=-0.5, rsqrtss is a low precision 1/sqrt instruction.
or you can try parallel asm algorithm:
rsqrtps a,b
mulps b,a
mulps b,a
subps b,C1
mulps a,b
mulps a,C2
where C1={3.0,3,3,3}, C2={-0.5,-0.5,-0.5,-0.5}
Message Edited by p4top on 04-19-200608:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was talking about the instructions used in general, I wasn't suggesting him to use SQRTPS.
Depending on required precision there are much better combinations available.
Depending on required precision there are much better combinations available.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page