链接已复制
1 回复
I assume you're using single-precision, otherwise asking for rsqrt wouldn't make much sense...
It's not a question of the compiler you use, but of the CPU. With any recent SSE-capable CPU you will get the following if you put a 3D vector in one SSE register (which I can't generally recommend):
one subtraction (3 cycles/4 on AMD), then one multiplication (4 cycles), then two horizontal add instructions (2x 5 cycles), and then rsqrt (3 cycles).
If you had 4 x vectors and 4 y vectors this could be improved by putting the x1 values in one SSE register, the x2 values in another and so on. Then you'd calculate 3 subtractions which can be pipelined, 3 multipliations which can be pipelined, two additions and one rsqrt. Since the vertical additions are faster than the horizontal additions you'd get the result of 4 x and y vectors in basically the same time you got the one result with the vertical vectorization.
Cheers,
Matthias
It's not a question of the compiler you use, but of the CPU. With any recent SSE-capable CPU you will get the following if you put a 3D vector in one SSE register (which I can't generally recommend):
one subtraction (3 cycles/4 on AMD), then one multiplication (4 cycles), then two horizontal add instructions (2x 5 cycles), and then rsqrt (3 cycles).
If you had 4 x vectors and 4 y vectors this could be improved by putting the x1 values in one SSE register, the x2 values in another and so on. Then you'd calculate 3 subtractions which can be pipelined, 3 multipliations which can be pipelined, two additions and one rsqrt. Since the vertical additions are faster than the horizontal additions you'd get the result of 4 x and y vectors in basically the same time you got the one result with the vertical vectorization.
Cheers,
Matthias
