Low Precision VFNMADDSS on SDE(AVX Emulator)

wowtiger · ‎11-11-2008

I usedSDE to learn the FMA instruction, but i got trouble this...

result:
3183
3.14159265
10000
FMA:10000.000000-3183.000000*3.141593=0.310305
x87:10000.000000-3183.000000*3.141593=0.310547

is it bug?

Code:
__declspec(naked) float vfnmaddss(float a, float b, float c) {
__asm {
movss xmm0,dword ptr [esp+4]
movss xmm1,dword ptr [esp+8]
movss xmm2,dword ptr [esp+0Ch]
vfnmaddss xmm0, xmm0, xmm1, xmm2
movss dword ptr [esp+4],xmm0
fld dword ptr [esp+4]
ret
}
}

int main() {
float a,b,c;
scanf("%f",&a);
scanf("%f",&b);
scanf("%f",&c);
printf("FMA:%f-%f*%f=%f\n",c,a,b,vfnmaddss(a,b,c));
printf("x87:%f-%f*%f=%f\n",c,a,b,-(a*b)+c);
}

gabest · ‎11-12-2008

msvc:

x87:10000.000000-3183.000000*3.141593=0.310305

It's hard to tell which one is right, the input value of PI doesn't exactlyfit into a float, and we cannot see the exact results either because printf rounds not just PI (b) but the results too.

MarkC_Intel · ‎01-14-2009

(apologies for the delayed response. I was just notified about this posting.)

Hi, the different answers are because the FMA is using a fused multiply-add without an internal rounding step. On x87, you have an "extra" round between the multiply and the add. If you have a compiler that supports posix fused fma routine called "fmaf" for single precision, you'll see that you would get the answer that the Intel SDE FMA produces.

% cat fma44.c

#include

#include

int main() {

float a,b,c,d;

b = -3183;

c = 3.14159265;

d = 10000;

a = fmaf(b,c,d);

printf("%fn",a);

return 0;

}

% icc -o fma44 fma44.c -lm

% ./fma44

0.310305

gabest · ‎01-14-2009

Hm, this does not explain why msvc ended up with the same results without fma.

MarkC_Intel · ‎01-15-2009

Hi, For this test, MSVC uses SSE on Intel64 but uses x87 on IA32. If do the computation on IA32 using x87floating point hardware thenyouget answer that the fused fma gives since multiplying two 24bfractions will result in 48b fraction and that will easily fit in the 64b fraction of the 80b float. In this case, I believe Windows defaults to using double precision for the x87 floating point stack. Either way, the 53b fraction is sufficient to hold the exact product that would be used in the fused multiply-add.

If you use MSVC on Intel64 and thus use SSE single precision, then it gets the other answer.

Not surprisingly when there is approximate numerical representations of the input values and numerical cancelation like this, truedouble precision gives yet a thirdanswer.

Regards,
Mark

gabest · ‎01-15-2009

True, it was using double precision, just thought it would still be different from fma, due to the way it is calculated.