floating point operations for 1/r

pilot117 · ‎12-02-2010

Hi,

if i have two 3-d vectors x([x1,x2,x3]) and y,

then how many flops needed to get 1/|x-y| using icc?

I may interpret this functions as rsqrt((x1-y1)^2+(x2-y2)^2+(x3-y3)^2)...

thanks

Matthias_Kretz · ‎12-02-2010

I assume you're using single-precision, otherwise asking for rsqrt wouldn't make much sense...

It's not a question of the compiler you use, but of the CPU. With any recent SSE-capable CPU you will get the following if you put a 3D vector in one SSE register (which I can't generally recommend):
one subtraction (3 cycles/4 on AMD), then one multiplication (4 cycles), then two horizontal add instructions (2x 5 cycles), and then rsqrt (3 cycles).

If you had 4 x vectors and 4 y vectors this could be improved by putting the x1 values in one SSE register, the x2 values in another and so on. Then you'd calculate 3 subtractions which can be pipelined, 3 multipliations which can be pipelined, two additions and one rsqrt. Since the vertical additions are faster than the horizontal additions you'd get the result of 4 x and y vectors in basically the same time you got the one result with the vertical vectorization.

Cheers,
Matthias

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

floating point operations for 1/r