- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm doing some tests to compare the code generated by gcc 4.8.3 on a CentOS i7-5600U core while using sse optimizations.
Here is my code :
#include <stdio.h> #include <stdint.h> typedef union float_t { uint32_t i; float f; struct { uint32_t m:23, e: 8, s: 1; } p; } float_t; int main() { float_t a[4] = {}; float_t b[4] = {}; float_t c; a[0].f = 15.31; a[1].f = 641.11; a[2].f = 366.86; a[3].f = 403.68; b[0].f = 117.00; b[1].f = 859.38; b[2].f = 868.77; b[3].f = 974.59; c.f = a[0].f * b[0].f; c.f += a[1].f * b[1].f; c.f += a[2].f * b[2].f; c.f += a[3].f * b[3].f; printf ("%f 0x%08x %d:%x:%x\n", c.f, c.i, c.p.s, c.p.m, c.p.e); return 0; }
With the flags gcc -mavx test.c -o test, I've got this result : 1264887.875000 0x499a67bf 0:1a67bf:93
With the flags gcc -mno-sse test.c -o test, I've got this result : 0.000000 0x499a67be 0:1a67be:93
I'd like to know why the 2 mantissa are different (1a67bf != 1a67be) ?
For the first one, the generated asm is based on :
4005a0: c5 fa 10 4d f0 vmovss -0x10(%rbp),%xmm1 4005a5: c5 fa 10 45 e0 vmovss -0x20(%rbp),%xmm0 4005aa: c5 f2 59 c0 vmulss %xmm0,%xmm1,%xmm0 4005ae: c5 fa 11 45 d0 vmovss %xmm0,-0x30(%rbp)
whereas, for the second, it's :
4005a0: d9 45 f0 flds -0x10(%rbp) 4005a3: d9 45 e0 flds -0x20(%rbp) 4005a6: de c9 fmulp %st,%st(1) 4005a8: d9 5d dc fstps -0x24(%rbp)
Thanks,
Frédéric
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unlike SSE and AVX which use 32b to represent single precision floats (and 64b for double precision floats), the x87 instructions use longer internal precision in registers -- 80b.. Values are conveyed in registers between your steps even though they are also written to memory. So the larger x87 internal precision can result in numerical differences. This is a well recognized situation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel compilers default to setting 53 bit precision mode, but neither Linux nor gcc do that. So you will see extra precision in i486 mode. Supposedly only a minority of developers still use that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for these explanations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also,
c.f = a[0].f * b[0].f; c.f += a[1].f * b[1].f; c.f += a[2].f * b[2].f; c.f += a[3].f * b[3].f;
Could have been performed by using the FPU register stack to hold the accumulation of the result c.f until the final store. Thus improving the accuracy.
flds -0x10(%rbp) flds -0x20(%rbp) fmulp %st,%st(1) flds -0x30(%rbp) flds -0x40(%rbp) fmulp %st,%st(1) faddp %st,%st(1) flds -0x50(%rbp) flds -0x60(%rbp) fmulp %st,%st(1) faddp %st,%st(1) flds -0x70(%rbp) flds -0x80(%rbp) fmulp %st,%st(1) fadd %st,%st(1) fstps -0x24(%rbp)
Something like the above. Note, there is only one store. The newer AVX instructions have a Fused Multiply and Add, which gives you similar functionality, however after each multiply and add, the result is rounded to the precision (64/32 bits) of the storage. Therefore the FPU instructions do provide additional accuracy over SSE/AVX for instruction sequences that accumulate partial results within registers.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page