- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Rather than
y = c / x
use instead:
temp = fres(x)
temp = temp * (2 - x * temp)
y = c * temp
where fres is a PowerPC instruction that gives a single precision estimate of the reciprocal of its argument, and can be called directly by the PowerPC compiler in use. More steps of the N-R iteration
can be added if needed for more accuracy.
Is it possible to do the same operation (or get the same behaviour)
using ifort8 on the x86 architecture? The problem is what to use in place of fres(). I've found the SSE instruction rcpss, but it's not possible (?) to inline assembly with ifort, is that right?
Thanks in advance for any help.
- Etiquetas:
- Intel® Fortran Compiler
Enlace copiado
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Dear GMorris,
In the context of vector loops, the Intel compiler can use NR improvement of an approximated division. For example, the loop
PROGRAM JOHO
REAL Y(16), C(16), X(16)
DO I = 1, 16
Y(I) = C(I) / X(I)
ENDDO
END
is vectorized into one of the following two versions (depending on the switches, here given in Windows syntax):
yields:
$B1$3:
movaps xmm0, XMMWORD PTR JOHO$X$0$0[eax]
rcpps xmm1, xmm0
mulps xmm0, xmm1
mulps xmm0, xmm1
addps xmm1, xmm1
subps xmm1, xmm0
mulps xmm1, XMMWORD PTR JOHO$C$0$0[eax]
movaps XMMWORD PTR JOHO$Y$0$0[eax], xmm1
add eax, 16
cmp eax, 64
jb $B1$3
yields:
$B1$3:
movaps xmm0, XMMWORD PTR JOHO$C$0$0[eax]
divps xmm0, XMMWORD PTR JOHO$X$0$0[eax]
movaps XMMWORD PTR JOHO$Y$0$0[eax], xmm0
add eax, 16
cmp eax, 64
jb $B1$3
For various reasons, this division-control is not available at scalar level. Since I also believe it should be, however,I would appreciate your feature request to Premium Support to make my case stronger.
Aart Bik
http://www.aartbik.com/
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Playing around, I see that everything is different when double precision variables are involved. I would also be interested in the ability to use rcpss followed by (eg) two rounds of N-R iteration to get an estimate of the reciprocal of a double precision variable. (I don't know much about these issues, so it's possible this is not a sensible thing to want to do; but it would be interesting to be able to try!). It sounds like this level of control is not available, though - is that right?
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
sqrt() sequences also use N-R for vectorization, and the -mp1 (and -mp) are the only options which suppress N-R in those cases.
The greatest part of the speed gain for the N-R scheme comes because the IEEE accurate divide and sqrt() block the pipeline.
A common objection to the N-R scheme is that the various AMD architectures produce different results, due to varying (often better) accuracy of the approximate reciprocal. The N-R scheme could be modified to produce more often correctly rounded results, giving up some of the performance benefit.
As I think both of you are hinting, there are non-vectorizable situations where a performance advantage could be obtained by the N-R scheme.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado

- Suscribirse a un feed RSS
- Marcar tema como nuevo
- Marcar tema como leído
- Flotar este Tema para el usuario actual
- Favorito
- Suscribir
- Página de impresión sencilla