SSE and multiplication

Frédéric_D_ · ‎10-05-2015

Hi,

I'm doing some tests to compare the code generated by gcc 4.8.3 on a CentOS i7-5600U core while using sse optimizations.

Here is my code :

#include <stdio.h>
#include <stdint.h>

typedef union float_t
{
    uint32_t i;
    float f;
    struct {
        uint32_t m:23,
                 e: 8,
                 s: 1;
    } p;
} float_t;

int main()
{
    float_t a[4] = {};
    float_t b[4] = {};
    float_t c;

    a[0].f = 15.31;
    a[1].f = 641.11;
    a[2].f = 366.86;
    a[3].f = 403.68;

    b[0].f = 117.00;
    b[1].f = 859.38;
    b[2].f = 868.77;
    b[3].f = 974.59;

    c.f  = a[0].f * b[0].f;
    c.f += a[1].f * b[1].f;
    c.f += a[2].f * b[2].f;
    c.f += a[3].f * b[3].f;

    printf ("%f 0x%08x %d:%x:%x\n", c.f, c.i, c.p.s, c.p.m, c.p.e);

    return 0;
}

With the flags gcc -mavx test.c -o test, I've got this result : 1264887.875000 0x499a67bf 0:1a67bf:93

With the flags gcc -mno-sse test.c -o test, I've got this result : 0.000000 0x499a67be 0:1a67be:93

I'd like to know why the 2 mantissa are different (1a67bf != 1a67be) ?

For the first one, the generated asm is based on :

  4005a0:	c5 fa 10 4d f0       	vmovss -0x10(%rbp),%xmm1
  4005a5:	c5 fa 10 45 e0       	vmovss -0x20(%rbp),%xmm0
  4005aa:	c5 f2 59 c0          	vmulss %xmm0,%xmm1,%xmm0
  4005ae:	c5 fa 11 45 d0       	vmovss %xmm0,-0x30(%rbp)

whereas, for the second, it's :

  4005a0:	d9 45 f0             	flds   -0x10(%rbp)
  4005a3:	d9 45 e0             	flds   -0x20(%rbp)
  4005a6:	de c9                	fmulp  %st,%st(1)
  4005a8:	d9 5d dc             	fstps  -0x24(%rbp)

Thanks,

Frédéric

MarkC_Intel · ‎10-05-2015

Unlike SSE and AVX which use 32b to represent single precision floats (and 64b for double precision floats), the x87 instructions use longer internal precision in registers -- 80b.. Values are conveyed in registers between your steps even though they are also written to memory. So the larger x87 internal precision can result in numerical differences. This is a well recognized situation.

TimP · ‎10-05-2015

Intel compilers default to setting 53 bit precision mode, but neither Linux nor gcc do that. So you will see extra precision in i486 mode. Supposedly only a minority of developers still use that.

Frédéric_D_ · ‎10-06-2015

Thanks for these explanations.

jimdempseyatthecove · ‎10-10-2015

Also,

    c.f  = a[0].f * b[0].f;
    c.f += a[1].f * b[1].f;
    c.f += a[2].f * b[2].f;
    c.f += a[3].f * b[3].f;

Could have been performed by using the FPU register stack to hold the accumulation of the result c.f until the final store. Thus improving the accuracy.

flds   -0x10(%rbp)
flds   -0x20(%rbp)
fmulp  %st,%st(1)
flds   -0x30(%rbp)
flds   -0x40(%rbp)
fmulp  %st,%st(1)
faddp  %st,%st(1)
flds   -0x50(%rbp)
flds   -0x60(%rbp)
fmulp  %st,%st(1)
faddp  %st,%st(1)
flds   -0x70(%rbp)
flds   -0x80(%rbp)
fmulp  %st,%st(1)
fadd  %st,%st(1)
fstps  -0x24(%rbp)

Something like the above. Note, there is only one store. The newer AVX instructions have a Fused Multiply and Add, which gives you similar functionality, however after each multiply and add, the result is rounded to the precision (64/32 bits) of the storage. Therefore the FPU instructions do provide additional accuracy over SSE/AVX for instruction sequences that accumulate partial results within registers.

Jim Dempsey