Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

SSE and multiplication

Frédéric_D_
Beginner
697 Views

Hi,

I'm doing some tests to compare the code generated by gcc 4.8.3 on a CentOS i7-5600U core while using sse optimizations.

Here is my code :

#include <stdio.h>
#include <stdint.h>

typedef union float_t
{
    uint32_t i;
    float f;
    struct {
        uint32_t m:23,
                 e: 8,
                 s: 1;
    } p;
} float_t;

int main()
{
    float_t a[4] = {};
    float_t b[4] = {};
    float_t c;

    a[0].f = 15.31;
    a[1].f = 641.11;
    a[2].f = 366.86;
    a[3].f = 403.68;

    b[0].f = 117.00;
    b[1].f = 859.38;
    b[2].f = 868.77;
    b[3].f = 974.59;

    c.f  = a[0].f * b[0].f;
    c.f += a[1].f * b[1].f;
    c.f += a[2].f * b[2].f;
    c.f += a[3].f * b[3].f;

    printf ("%f 0x%08x %d:%x:%x\n", c.f, c.i, c.p.s, c.p.m, c.p.e);

    return 0;
}

With the flags gcc -mavx test.c -o test, I've got this result : 1264887.875000 0x499a67bf 0:1a67bf:93

With the flags gcc -mno-sse test.c -o test, I've got this result : 0.000000 0x499a67be 0:1a67be:93

I'd like to know why the 2 mantissa are different (1a67bf != 1a67be) ?

For the first one, the generated asm is based on :

  4005a0:	c5 fa 10 4d f0       	vmovss -0x10(%rbp),%xmm1
  4005a5:	c5 fa 10 45 e0       	vmovss -0x20(%rbp),%xmm0
  4005aa:	c5 f2 59 c0          	vmulss %xmm0,%xmm1,%xmm0
  4005ae:	c5 fa 11 45 d0       	vmovss %xmm0,-0x30(%rbp)

whereas, for the second, it's :

  4005a0:	d9 45 f0             	flds   -0x10(%rbp)
  4005a3:	d9 45 e0             	flds   -0x20(%rbp)
  4005a6:	de c9                	fmulp  %st,%st(1)
  4005a8:	d9 5d dc             	fstps  -0x24(%rbp)

 

Thanks,

Frédéric

 

0 Kudos
4 Replies
MarkC_Intel
Moderator
697 Views

Unlike SSE and AVX which use 32b to represent single precision floats (and 64b for double precision floats), the x87 instructions use longer internal precision in registers -- 80b.. Values are conveyed in registers between your steps even though they are also written to memory.  So the larger x87 internal precision can result in numerical differences. This is a well recognized situation. 

0 Kudos
TimP
Honored Contributor III
697 Views

Intel compilers default to setting 53 bit precision mode, but neither Linux nor gcc do that. So you will see extra precision in i486 mode. Supposedly only a minority of developers still use that.

0 Kudos
Frédéric_D_
Beginner
697 Views

Thanks for these explanations.

0 Kudos
jimdempseyatthecove
Honored Contributor III
697 Views

Also,

    c.f  = a[0].f * b[0].f;
    c.f += a[1].f * b[1].f;
    c.f += a[2].f * b[2].f;
    c.f += a[3].f * b[3].f;

Could have been performed by using the FPU register stack to hold the accumulation of the result c.f until the final store. Thus improving the accuracy.

flds   -0x10(%rbp)
flds   -0x20(%rbp)
fmulp  %st,%st(1)
flds   -0x30(%rbp)
flds   -0x40(%rbp)
fmulp  %st,%st(1)
faddp  %st,%st(1)
flds   -0x50(%rbp)
flds   -0x60(%rbp)
fmulp  %st,%st(1)
faddp  %st,%st(1)
flds   -0x70(%rbp)
flds   -0x80(%rbp)
fmulp  %st,%st(1)
fadd  %st,%st(1)
fstps  -0x24(%rbp)

Something like the above. Note, there is only one store. The newer AVX instructions have a Fused Multiply and Add, which gives you similar functionality, however after each multiply and add, the result is rounded to the precision (64/32 bits) of the storage. Therefore the FPU instructions do provide additional accuracy over SSE/AVX for instruction sequences that accumulate partial results within registers.

Jim Dempsey

0 Kudos
Reply