Solved: Why does Cilk Plus __sec_reduce_add differ from a sequential addition?

Vladimir_F_2 · ‎11-20-2015

I'm finding a significant error for certain input values when I use a particular offset to add values. Due to the nature of the algorithm I'm considering (not relevant to the question), I need to start a sum value with the first element of an array. I was able to extract a minimal working example, which is appended below.

Even though there are less than 50 floating point additions involved, I'm seeing error far beyond that order of magnitude of ulps, and I don't understand what I am doing wrong. See the following code snippet, along with the invocation and what my results are:

// icc -std=c++11 -O3 commutative.cpp -o commutative -lm -lcilkrts -fp-model precise -fp-model source
// echo 0x1.2a30cp+36 -0x1.04b3b6p+35 0x1.92204p+32 -0x1.b9319p+34 0x1.1a88f4p+33 -0x1.39b0f2p+35 0x1.744a68p+31 -0x1.73385p+34 0x1.6ad72ep+34 -0x1.8a076cp+34 0x1.dc6578p+34 0x1.8f075ep+35 0x1.d1bd22p+34 0x1.ff211p+30 0x1.43b8cap+36 -0x1.1efd2p+32 0x1.1e646cp+35 0x1.c91e16p+34 -0x1.b1e3eap+35 -0x1.52ce1p+32 -0x1.cf6f2cp+34 -0x1.c7007ep+35 -0x1.efe1fp+35 0x1.90143cp+34 | ./commutative

#include <cstdlib>
#include <cstdio>
#include <cmath>
#include <vector>

#include <cilk/cilk.h>

int main(int argc, char**) {
    if (argc != 1) {
        printf("Usage: echo 1 2 3 | commutative\n");
        return 1;
    }

    std::vector<float> vec;
    while(!feof(stdin)) {
        float f;
        if (scanf("%a", &f) > 0) {
            vec.push_back(f);
        }
    }

    auto sz = vec.size();
    printf("size %d\n", sz);

    float* arr = &vec.front();
    float sum1 = arr[0];
    sum1 += __sec_reduce_add(arr[1:sz-1] + arr[1:sz-1]);

    float sum2 = arr[0];
    for (int i = 1; i < sz; ++i) sum2 += arr + arr;

    if (sum1 != sum2) printf("ERROR! unequal sums (%f vs %f), rel. err %f\n",
                             sum1, sum2, (float) std::abs((sum1 - sum2) / sum2));
    printf("sum1 %a\n", sum1);
    printf("sum2 %a\n", sum2);
    return sum1 != sum2;
}

size 24
ERROR! unequal sums (-13099008.000000 vs -13139968.000000), rel. err 0.003117
sum1 -0x1.8fcp+23
sum2 -0x1.91p+23

$ icc --version
icc (ICC) 15.0.2 20150121
Copyright (C) 1985-2015 Intel Corporation.  All rights reserved.

jimdempseyatthecove · ‎11-27-2015

In your #2 post, second example, you are summing numbers that differ in magnitude by as much as 10^18.

When summing (and other types of operations), the order of operations can greatly affect the outcome of the results.

For the highest precision of output, sum the numbers from lowest magnitude to highest magnitude. Or at least, sum numbers of similar magnitude, then sum the sums (lowest magnitude to highest magnitude). Doing otherwise may/will result in lesser magnitude numbers not being included at all.

What you have now is comparing one arbitrary low accuracy summation against a different arbitrary low accuracy summation.

Jim Dempsey

View solution in original post

Vladimir_F_2 · ‎11-22-2015

Here's a particularly egregious example (20% relative error, with only about 150 operations!). The attached file, multiple.c, when compiled into an executable called multiple, finds a dot product that's biased by the first value. For input, in hex floating point, of the form "x0 x1 x2 x3 x4 w0 w1 w2 w3 w4" in standard in, with a dimension of 5, it computes the dot product of odd indices of x with even indices of w, leaving off the last element. Then it adds in the x0 bias. In other words, the desired result would be x0 + x1 * w0 + x3 * w2.

Here's a "correct" run:

$ echo '1 2 3 4 5 6 7 8' | ./multiple 4
size 4
dots 0x1.6p+3 0x1.6p+3
relative err 0.000000

Here's something that came up in practice:

$ echo '-0x1.336f26p+18 0x1.21f34p+15 0x1.7ed2f6p+19 -0x1.ccde3ep+19 0x1.458488p+17 -0x1.1f30f2p+19 0x1.13b5dcp+18 -0x1.db376p+16 0x1.6844c8p+20 0x1.82c654p+18 0x1.82073ep+19 0x1.996914p+19 -0x1.7aa87ap+18 -0x1.58a008p+16 -0x1.2fe9eap+18 -0x1.7234ap+17 0x1.c6dc5cp+17 0x1.bb6dbcp+19 -0x1.288178p+19 0x1.35aefp+16 -0x1.533036p+18 0x1.52871p+16 0x1.b0953ap+19 -0x1.09c72cp+18 -0x1.319d86p+19 0x1.f559f8p+18 -0x1.0264ccp+20 -0x1.a5644cp+18 -0x1.d04bd8p+17 -0x1.f5f148p+16 0x1.147abcp+20 -0x1.72714ep+19 -0x1.6e79f8p+17 0x1.589a7ap+17 -0x1.f4bf78p+15 -0x1.23736cp+20 -0x1.461342p+18 0x1.2062ep+17 -0x1.a0cddap+19 0x1.59d574p+18 -0x1.b4aae2p+16 -0x1.0d996cp+19 -0x1.3f3214p+19 0x1.26672p+16 0x1.78fd4ep+20 -0x1.d675d2p+19 -0x1.3bde2ap+19  0x1.ec1dap-2 0x1.c0ff94p-1 -0x1.137f88p-1 0x1.af8f94p-1 -0x1.fedb4cp-1 -0x1.119528p-4 -0x1.af0986p-2 -0x1.d06e38p-1 0x1.2fb542p-1 -0x1.9c3224p-1 0x1.fb6e82p-1 0x1.10f8cp-3 0x1.700898p-2 0x1.ddc9dap-1 -0x1.4a8fbcp-1 0x1.86fd6cp-1 -0x1.f5bd8ap-1 -0x1.97eed2p-3 -0x1.2f62dep-2 -0x1.e90332p-1 0x1.63f044p-1 -0x1.7009a8p-1 0x1.edcee8p-1 0x1.0e893ep-2 0x1.dac486p-3 0x1.f20d6cp-1 -0x1.7bb9d4p-1 0x1.577118p-1 -0x1.e3abaap-1 -0x1.4fe5cp-2 -0x1.54a47cp-3 -0x1.f8de34p-1 0x1.91d13ap-1 -0x1.3d4fdcp-1 0x1.d75f6cp-1 0x1.8fc234p-2 0x1.99fdfp-4 0x1.fd6dbcp-1 -0x1.a61d34p-1 0x1.21c3d2p-1 -0x1.c8f83ap-1 -0x1.cdd598p-2 -0x1.11bc4ap-5 -0x1.ffb6cep-1 0x1.b8868ep-1 -0x1.04ec78p-1 0x1.700898p-2' | ./multiple 47
size 47
dots 0x1.dp-1 0x1.2p+0
relative err 0.194444

Here's another very bad instance:

$ echo '1 2 3 4 5 6 7 8 9 0' | ./multiple 5
size 5
dots 0x1.ap+3 0x1.68p+5
relative err 0.711111

All experiments are run on an Intel(R) Xeon(R) E5-2670 0. OS:

$ lsb_release -a
LSB Version:    :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: CentOS
Description:    CentOS release 6.6 (Final)
Release:        6.6
Codename:       Final

TimP · ‎11-24-2015

Particularly when you set options to bring icc toward standard compliance, you can't expect the compiler to depart much from standard order. I don't know that the reducer is required not to optimize when you set various -fp-model options, but even if so:

In your first example, you force the order of additions to differ. In order to have a chance to see matching results, try e.g.

`32`	`float` `sum2 = 0;`

`33`	`for` `(int` `i = 1; i < sz; ++i) sum2 += arr + arr;` `sum2 += arr[0];`

Vladimir_F_2 · ‎11-24-2015

Thanks for the response, Tim.

Are you looking at the most current version of the post? I think in both the __sec_reduce_add and the sequential add I initialize both sums to 0. In multiple.c, you are right to point out the wrong ordering. I've corrected it (attached), but I'm still seeing 4% relative error on the in-line example, which is again too much for the number of operations considered.

jimdempseyatthecove · ‎11-27-2015

In your #2 post, second example, you are summing numbers that differ in magnitude by as much as 10^18.

When summing (and other types of operations), the order of operations can greatly affect the outcome of the results.

For the highest precision of output, sum the numbers from lowest magnitude to highest magnitude. Or at least, sum numbers of similar magnitude, then sum the sums (lowest magnitude to highest magnitude). Doing otherwise may/will result in lesser magnitude numbers not being included at all.

What you have now is comparing one arbitrary low accuracy summation against a different arbitrary low accuracy summation.

Jim Dempsey

Vladimir_F_2 · ‎11-27-2015

Ah, you're right Jim - I didn't consider the fact that the high-magnitude numbers may cancel, which would greatly affect the results if the low-magnitude numbers were added in before vs. after (because, as you mention, adding them in before, we would lose that information).

To confirm, I replicated the example of multiple.c switching to double precision (which should be able to represent the 10^18 difference without accuracy loss), and this was indeed the case.

It seems that the highly divergent results are unavoidable with floats.

Thanks for taking the time to answer the question!