Quote:jimdempseyatthecove

unrue · ‎04-10-2015

Dear Intel developers,

I'm using Intel Intrinsics with Intel 13.1.3. I have some rounding error by using _mm_mul_ps routine. This is the piece of code:

       __m128 denom_tmp = _mm_setzero_ps();
       __m128  sample_tmp = sample_r1;       
       __m128  sample_tmp2;
        for(j = 0; j < tot_iter; j+=UNROLL_STEP) {
            sample_tmp2 = _mm_mul_ps(*sample_tmp, *sample_tmp);
            denom_tmp = _mm_add_ps(sample_tmp2, denom_tmp);
            sample_tmp++; 
        }

sample_r1 is filled before that code. My goal is simply to do a square of each elements of sample_r1.These are the first four values of sample_r1: (-570,911 -236,614 36,6958 27,5522).

After the loop above, the first four results of denom_tmp are: (325940 55986,2 1346,58 759,121).

The correct results done by hand and compared with scalar version is:( 325939,369921 55986,184996 1346,58173764, 759,12372484). So, I have an incredible rounding error. Maybe I'm using the multiplication routine in a wrong manner? Can I increase the precision in some way?

Thanks in advance.

EDIT: I'm in debug mode.

jimdempseyatthecove · ‎04-10-2015

Although the precision of a float is 24 bits (~7.225 digits), the least significant bit may contain a round-off error depending on initial value. The product of two such floats (with round-off error) effectively squares the error (doubles the number of bits in error). Meaning the first product of two floats initialized with round off error might be accurate to ~6.8 digits. The results you show are within this margin of error.

If you require higher precision, then use double precision floats in C and ..._pd in the intrinsics.

Jim Dempsey

unrue · ‎04-10-2015

jimdempseyatthecove wrote:

Although the precision of a float is 24 bits (~7.225 digits), the least significant bit may contain a round-off error depending on initial value. The product of two such floats (with round-off error) effectively squares the error (doubles the number of bits in error). Meaning the first product of two floats initialized with round off error might be accurate to ~6.8 digits. The results you show are within this margin of error.

If you require higher precision, then use double precision floats in C and ..._pd in the intrinsics.

Jim Dempsey

Hi Jim, thanks for your reply. But why I see two different behaviour among serial and parallel with Intrinsics? The precision should be the same, but the results are quite different. Starting from the same float value, the square serial and parallel should be the same, or not?

jimdempseyatthecove · ‎04-10-2015

>>But why I see two different behaviour among serial and parallel with Intrinsics?

You have not shown enough code for me to see what you are doing in the parallel loop.

Your #1 listed code is in error or is paraphrased incompletely as you are using sample_tmp as both a pointer and an __m128.

If your actual code is performing a sum of the squares of an array, the serial section will run the array and the sums from first element of the array to the last element of the array and accumulating the round off errors in the order from first to last in the entire array.

When you perform this in parallel (possibly with using a reduction operator on the sum), each thread takes a section of the array. The first section should compute the same result (with the same error) up to the end point of that section. The thread taking the second section, will start with a temporary accumulator that is 0.0f (or __m128 vector of these if you coded that way). The first result of the second slice is adding to 0.0 and not to a partial sum containing some error). The second slice of the array has the same issue. Then at the end of the parallel region where the reduction occurs, the reduction of partial sums with errors cannot produce (well not likely to produce) the same error as that of the serial section.

You can have a similar issue where you have an array of whole numbers (without error) but where the sum (or sum of squares) exceeds the precision of the type (float or double). In this situation, error can be reduced by performing the operation from the smallest numbers to the largest numbers. It takes time to sort the input, but you can make collections of big, medium and small numbers or perform the summation using a filter in one pass (e.g. filter into say 10 buckets).

Jim Dempsey

Jeff_Arnold · ‎04-11-2015

Obligatory reference: What Every Computer Scientist Should Know About Floating-Point Arithmetic. If you haven't read it, you should read it now. If you have read it, you may want to read it again.

There are several reasons your result is not 325939.369921 (which is the exact value of the square of 570.911).

You are not actually squaring 570.911. There is no single-precision floating-point number whose value is 570.911. The single-precision floating-point number closest to 570.911 has the value 570.9110107421875 which is greater than 570.911 by approximately 10^(-5). Thus, you are actually computing the square of (570.911 + 0.0000107421875).
There is no single-precision floating-point number whose value is 325939.369921. Thus it is not possible to generate the result 325939.369921 using single-precision floating-point operations and operands. The single-precision floating-point value closest to 325939.369921 is 325939.375.

As to your "result" of 325940 for this case, the result you see may be influenced by the debugging environment (e.g., perhaps it is rounding to 5 significant digits for some reason).

Another check would be to examine the values in hex so that you see the exact bits in the floating-point result. For single-precision, 325939.375 should have the value 0x489f266c.

The program

#include <stdio.h>
int main(int argc, char* argv[] ) {
    float a = 570.911;
    float b = a*a;
    printf("a: %f  b: %f (%#x)\n",a,b,*(unsigned int*)(&b));
    return 0;
}

will give 325939.375 and 0x489f266c for b as expected. (MacPorts gcc 4.9.2_1 on Mac OS X 10.10.3.)

Similarly

#include <stdio.h>
#include <xmmintrin.h>
int main(int argc, char* argv[] ) {
    const float a[4] __attribute__ ((aligned (16))) = {-570.911f, -236.614f, 36.6958f, 27.5522f};
    __m128 b;
    __m128 c;
    float d[4] __attribute__ ((aligned (16)));
    b = _mm_load_ps( a );
    c = _mm_mul_ps(b,b);
    _mm_store_ps(d,c);
    printf("Result: %f %f %f %f\n",d[0],d[1],d[2],d[3]);
    return 0;
}

gives

Result: 325939.375000 55986.183594 1346.581787 759.123718

[Edited on 2015-04-13 to correct the Mac OS X version to 10.10.3.]

unrue · ‎04-13-2015

jimdempseyatthecove wrote:

>>But why I see two different behaviour among serial and parallel with Intrinsics?

You have not shown enough code for me to see what you are doing in the parallel loop.

Your #1 listed code is in error or is paraphrased incompletely as you are using sample_tmp as both a pointer and an __m128.

If your actual code is performing a sum of the squares of an array, the serial section will run the array and the sums from first element of the array to the last element of the array and accumulating the round off errors in the order from first to last in the entire array.

When you perform this in parallel (possibly with using a reduction operator on the sum), each thread takes a section of the array. The first section should compute the same result (with the same error) up to the end point of that section. The thread taking the second section, will start with a temporary accumulator that is 0.0f (or __m128 vector of these if you coded that way). The first result of the second slice is adding to 0.0 and not to a partial sum containing some error). The second slice of the array has the same issue. Then at the end of the parallel region where the reduction occurs, the reduction of partial sums with errors cannot produce (well not likely to produce) the same error as that of the serial section.

You can have a similar issue where you have an array of whole numbers (without error) but where the sum (or sum of squares) exceeds the precision of the type (float or double). In this situation, error can be reduced by performing the operation from the smallest numbers to the largest numbers. It takes time to sort the input, but you can make collections of big, medium and small numbers or perform the summation using a filter in one pass (e.g. filter into say 10 buckets).

Jim Dempsey

Hi Jim, thanks for your reply. sample_tmp is a pointer, so I did a little mistake when I posted my code and the code should be correct. In effect I'm not doing operations in the same order as the serial, so maybe the best way is try to do the same order of the serial, if possible.

Rounding error with Intrinsics?