Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
Comunicados
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

SSE and multiplication

Frédéric_D_
Principiante
1.108 Visualizações

Hi,

I'm doing some tests to compare the code generated by gcc 4.8.3 on a CentOS i7-5600U core while using sse optimizations.

Here is my code :

#include <stdio.h>
#include <stdint.h>

typedef union float_t
{
    uint32_t i;
    float f;
    struct {
        uint32_t m:23,
                 e: 8,
                 s: 1;
    } p;
} float_t;

int main()
{
    float_t a[4] = {};
    float_t b[4] = {};
    float_t c;

    a[0].f = 15.31;
    a[1].f = 641.11;
    a[2].f = 366.86;
    a[3].f = 403.68;

    b[0].f = 117.00;
    b[1].f = 859.38;
    b[2].f = 868.77;
    b[3].f = 974.59;

    c.f  = a[0].f * b[0].f;
    c.f += a[1].f * b[1].f;
    c.f += a[2].f * b[2].f;
    c.f += a[3].f * b[3].f;

    printf ("%f 0x%08x %d:%x:%x\n", c.f, c.i, c.p.s, c.p.m, c.p.e);

    return 0;
}

With the flags gcc -mavx test.c -o test, I've got this result : 1264887.875000 0x499a67bf 0:1a67bf:93

With the flags gcc -mno-sse test.c -o test, I've got this result : 0.000000 0x499a67be 0:1a67be:93

I'd like to know why the 2 mantissa are different (1a67bf != 1a67be) ?

For the first one, the generated asm is based on :

  4005a0:	c5 fa 10 4d f0       	vmovss -0x10(%rbp),%xmm1
  4005a5:	c5 fa 10 45 e0       	vmovss -0x20(%rbp),%xmm0
  4005aa:	c5 f2 59 c0          	vmulss %xmm0,%xmm1,%xmm0
  4005ae:	c5 fa 11 45 d0       	vmovss %xmm0,-0x30(%rbp)

whereas, for the second, it's :

  4005a0:	d9 45 f0             	flds   -0x10(%rbp)
  4005a3:	d9 45 e0             	flds   -0x20(%rbp)
  4005a6:	de c9                	fmulp  %st,%st(1)
  4005a8:	d9 5d dc             	fstps  -0x24(%rbp)

 

Thanks,

Frédéric

 

0 Kudos
4 Respostas
MarkC_Intel
Moderador
1.108 Visualizações

Unlike SSE and AVX which use 32b to represent single precision floats (and 64b for double precision floats), the x87 instructions use longer internal precision in registers -- 80b.. Values are conveyed in registers between your steps even though they are also written to memory.  So the larger x87 internal precision can result in numerical differences. This is a well recognized situation. 

TimP
Colaborador honorário III
1.108 Visualizações

Intel compilers default to setting 53 bit precision mode, but neither Linux nor gcc do that. So you will see extra precision in i486 mode. Supposedly only a minority of developers still use that.

Frédéric_D_
Principiante
1.108 Visualizações

Thanks for these explanations.

jimdempseyatthecove
Colaborador honorário III
1.108 Visualizações

Also,

    c.f  = a[0].f * b[0].f;
    c.f += a[1].f * b[1].f;
    c.f += a[2].f * b[2].f;
    c.f += a[3].f * b[3].f;

Could have been performed by using the FPU register stack to hold the accumulation of the result c.f until the final store. Thus improving the accuracy.

flds   -0x10(%rbp)
flds   -0x20(%rbp)
fmulp  %st,%st(1)
flds   -0x30(%rbp)
flds   -0x40(%rbp)
fmulp  %st,%st(1)
faddp  %st,%st(1)
flds   -0x50(%rbp)
flds   -0x60(%rbp)
fmulp  %st,%st(1)
faddp  %st,%st(1)
flds   -0x70(%rbp)
flds   -0x80(%rbp)
fmulp  %st,%st(1)
fadd  %st,%st(1)
fstps  -0x24(%rbp)

Something like the above. Note, there is only one store. The newer AVX instructions have a Fused Multiply and Add, which gives you similar functionality, however after each multiply and add, the result is rounded to the precision (64/32 bits) of the storage. Therefore the FPU instructions do provide additional accuracy over SSE/AVX for instruction sequences that accumulate partial results within registers.

Jim Dempsey

Responder