Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Fast estimate for division

gmorris
Principiante
1.065 Vistas
I was recently introduced to this (Newton-Raphson) technique for getting a faster, lower accuracy estimate of a division, on the PowerPC architecture.

Rather than
y = c / x

use instead:

temp = fres(x)
temp = temp * (2 - x * temp)
y = c * temp

where fres is a PowerPC instruction that gives a single precision estimate of the reciprocal of its argument, and can be called directly by the PowerPC compiler in use. More steps of the N-R iteration
can be added if needed for more accuracy.

Is it possible to do the same operation (or get the same behaviour)
using ifort8 on the x86 architecture? The problem is what to use in place of fres(). I've found the SSE instruction rcpss, but it's not possible (?) to inline assembly with ifort, is that right?


Thanks in advance for any help.
0 kudos
5 Respuestas
Intel_C_Intel
Empleados
1.065 Vistas

Dear GMorris,

In the context of vector loops, the Intel compiler can use NR improvement of an approximated division. For example, the loop

PROGRAM JOHO
REAL Y(16), C(16), X(16)
DO I = 1, 16
Y(I) = C(I) / X(I)
ENDDO
END

is vectorized into one of the following two versions (depending on the switches, here given in Windows syntax):

ifort -Qprec-div- -Qunroll0 -QxP -Fa joho.f

yields:

$B1$3:
movaps xmm0, XMMWORD PTR JOHO$X$0$0[eax]
rcpps xmm1, xmm0
mulps xmm0, xmm1
mulps xmm0, xmm1
addps xmm1, xmm1
subps xmm1, xmm0
mulps xmm1, XMMWORD PTR JOHO$C$0$0[eax]
movaps XMMWORD PTR JOHO$Y$0$0[eax], xmm1
add eax, 16
cmp eax, 64
jb $B1$3

ifort -Qprec-div -Qunroll0 -QxP -Fa joho.f

yields:

$B1$3:
movaps xmm0, XMMWORD PTR JOHO$C$0$0[eax]
divps xmm0, XMMWORD PTR JOHO$X$0$0[eax]
movaps XMMWORD PTR JOHO$Y$0$0[eax], xmm0
add eax, 16
cmp eax, 64
jb $B1$3

For various reasons, this division-control is not available at scalar level. Since I also believe it should be, however,I would appreciate your feature request to Premium Support to make my case stronger.

Aart Bik
http://www.aartbik.com/

gmorris
Principiante
1.065 Vistas
Thanks for your reply, I found it very informative. I'll submit a Premier Support issue and see what they say.

Playing around, I see that everything is different when double precision variables are involved. I would also be interested in the ability to use rcpss followed by (eg) two rounds of N-R iteration to get an estimate of the reciprocal of a double precision variable. (I don't know much about these issues, so it's possible this is not a sensible thing to want to do; but it would be interesting to be able to try!). It sounds like this level of control is not available, though - is that right?
gmorris
Principiante
1.065 Vistas
Whoops, that was a bit dumb. I could just make a single precision copy of the double precision variable and work with that, I suppose.
TimP
Colaborador Distinguido III
1.065 Vistas
Aart quoted Windows compiler options in his response. For the linux compiler, options -mp1 or -prec_div suppress the Newton iteration scheme for single precision parallel and use instead the IEEE accurate instructions.
sqrt() sequences also use N-R for vectorization, and the -mp1 (and -mp) are the only options which suppress N-R in those cases.
The greatest part of the speed gain for the N-R scheme comes because the IEEE accurate divide and sqrt() block the pipeline.
A common objection to the N-R scheme is that the various AMD architectures produce different results, due to varying (often better) accuracy of the approximate reciprocal. The N-R scheme could be modified to produce more often correctly rounded results, giving up some of the performance benefit.
As I think both of you are hinting, there are non-vectorizable situations where a performance advantage could be obtained by the N-R scheme.
gmorris
Principiante
1.065 Vistas
Just for the record, I submitted a feature request to Premier Support about this, as was suggested. They agree that the treatment of vectors and scalars should be consistent, but apparently there is a "related issue" that needs to be solved before this feature can be implemented.
Responder