reducers under nested iterations

Yves_V_Intel · ‎03-31-2016

Hello, I am doing some SpMV-related work and exploring the use of CilkPlus. I had a question related to reducers that I could not find out myself reading the documentation. In short: is there a simple or performant way of declaring a logical set of reducers or a reducer 'holder' such that an inner cilk_for uses its own reducer hyperobject, without the outer cilk_for having to share the same hyperobject over all of its strands.

Consider the following C99-CilkPlus loop code, which calculates a sparse binary matrix-vector multiplications for eight vectors simultaneously:

  cilk_for(int row = 0; row < A->nrow; row++) {
    double tmp[8] = {0};
    for (int i = row_ptr[row]; i < row_ptr[row + 1]; i++) {
      int col = cols << 3;
      tmp[:] += X[col:8];
    }
    int r = row << 3;
    Y = tmp[:];
  }

Now consider the case where one would want to futher parallelize the inner loop. Now, even in OpenMP4.0 I get into trouble here, as I cannot declare an array in #pragma omp reduction(+:tmp). Similarly, I cannot use the built-in opadd reducer in CilkPlus as double[8] is not a simple numeric datatype, so I create my custom add reducer for a double[8] vector:

void reduce_vecsum(void* reducer, void* left, void* right) {
  vdp8* vl = (vdp8*)left;
  vdp8* vr = (vdp8*)right;
  (*vl)[:] += (*vr)[:];
}
void identity_vecsum(void* reducer, void* v) {
  (*(vdp8*)v)[:] = 0;
}
CILK_C_DECLARE_REDUCER(vdp8) cilk_c_vecsum_reducer =
  CILK_C_INIT_REDUCER(vdp8,
                      reduce_vecsum,
                      identity_vecsum,
                      __cilkrts_hyperobject_noop_destroy,
                      {0,0,0,0,0,0,0,0});
vdp8* vecsum_view() {
  return (vdp8*)REDUCER_VIEW(cilk_c_vecsum_reducer);
}

The big question is, considering the determinacy guarantees of reducers, would it be correct to do the following:

  CILK_C_REGISTER_REDUCER(cilk_c_vecsum_reducer); //runtime starts managing thread-local views, no need for manual tmp[]
  cilk_for (int row = 0; row < A->nrow; row++) {
    vdp8 *vsum;
    vsum = vecsum_view();
    (*vsum)[:] = 0;
    cilk_for (int i = row_ptr[row]; i < row_ptr[row + 1]; i++) {
      vdp8 *vtmp;
      int col = cols << 3; // multiplying with 8
      vtmp = vecsum_view(); // grab a local view, will reduce automatically on task/strand joins
      (*vtmp)[:] += X[col:8];
    }
    int r = row << 3;
    vsum = vecsum_view(); // grab the sum
    Y = (*vsum)[:];  // commit to the output vector
    // note: outer cilk_for will perform further reductions, although we do not need the result
  }
  CILK_C_UNREGISTER_REDUCER(cilk_c_vecsum_reducer);

My worry is that a steal of strands from the inner cilk_for might cause the sums of two different rows to become mingled. The secondary worry is the overhead of performing superfluous reductions of the reducer at the joins of the outer cilk_for loop. In other words, is the above correct and is there a better way of doing something similar?

note: I am using the above as an example, in reality the inner loop is complete overkill. However, I am working on a blocked version which does display the same nested loop structure, with the same need to reduce on the output vector.

Pablo_H_Intel · ‎03-31-2016

Congratulations on figuring out how to do reductions in Cilk Plus using C. The C++ interface is much simpler, but you have your reasons for using C and you have it right except you are theoretically reducing over the wrong scope. Your concern about the inner sums getting intermingled is unfounded -- the summing operation within the inner loop will accumulate only into the view visible within the outer loop. Since you are zeroing this view before you use it, and since you do not use the final value of the reduction on exit from the outer loop, your code should produce correct results. Your concern about superfluous reductions in the outer loop, however, is a real one. The theoretical notion of a view is that you should not apply any operation that is not part of the reduction. In your case, you are zeroing the view before the inner loop and then using the partial sum after the inner loop. Assignment is not part of the associative addition operation, and will result in a nonsense value at the end of the outer loop. However, Cilk Plus reducers explicitly allow you to depend on the partial sum -- we do it quite a bit ourselves -- but it is slightly unclean.

The preferred alternative is to define the reducer as a local variable inside the outer loop, as follows:

cilk_for (int row = 0; row < A->nrow; row++) {
  CILK_C_DECLARE_REDUCER(vdp8) cilk_c_vecsum_reducer =
    CILK_C_INIT_REDUCER(vdp8,
                        reduce_vecsum,
                        identity_vecsum,
                        __cilkrts_hyperobject_noop_destroy,
                        {0,0,0,0,0,0,0,0});
  CILK_C_REGISTER_REDUCER(cilk_c_vecsum_reducer);
  cilk_for (int i = row_ptr[row]; i < row_ptr[row + 1]; i++) {
    vdp8 *vtmp = (vdp8*) REDUCER_VIEW(cilk_c_vecsum_reducer);
    int col = cols << 3; // multiplying with 8
    (*vtmp)[:] += X[col:8];
  }
  int r = row << 3;
  vsum = (vdp8*) REDUCER_VIEW(cilk_c_vecsum_reducer); // grab the sum
  Y = (*vsum)[:];  // commit to the output vector
  CILK_C_UNREGISTER_REDUCER(cilk_c_vecsum_reducer);
}

Here, cilk_c_vecsum_reducer has the same scope and lifetime as the tmp variable did in the original code. You are doing a reduction only on the inner loop instead of over both loops. However, your version might still be faster because, although it does an unnecessary reduction on the join of the outer loop, it avoids the overhead of registering and unregistering the reducer each time through. If you benchmark both versions (and you should), I would guess that your version is the same or faster than mine, even if it's slightly "unclean." So everything I've written turns out to be a long way of saying that I think you have the best possible version already. Well done! Let us know how your benchmarks turn out.

Yves_V_Intel · ‎04-01-2016

Thanks, it appears declaring and registering reducers inside the scope is a tiny bit faster than my version on my current dataset. Note however that the nonzero elements in each row is several orders of magnitude smaller than the number of rows, which explains why the superfluous reduction overhead is larger than the reducer creation overhead.

Thank you, this was very insightful.