cilk_for's divide and conquer strategy

chrisfeist · ‎05-14-2012

Hello,

I have a quick question concering the strategy, that the Cilk Plus runtime system uses to divide up cilk_for iterations. I saw a few examples that showed how this worked when you have a number of iterations that is a power of 2, but how is this done in a general case? Does the grain size play a role here? Or is the number of iterations upped to the next power of 2 and then some kind of work stealing kicks in by those workers, who have nothing to do?

If anyone could give me some info on that, or direct me to a file/documentation that explains how this works, I would be very grateful!

Cheers,
Chris

Barry_T_Intel · ‎05-15-2012

The source is available, both in the GCC 4.7 "cilkplus" branch, and at the Cilk website: http://software.intel.com/en-us/articles/download-intel-cilk-plus-source/. cilk_for is implemented in cilk-abi-cilk-for.cpp. The function you're looking for is cilk_for_recursive()

But to answer your question, it simply divides the range in half, spawning half and calling half. Simple integer division:

count_t mid = low + count / 2;

- Barry

BradleyKuszmaul · ‎05-15-2012

Adding to Barry's response: The grain size does play an issue. It determines when the recursion stops.

Barry_T_Intel · ‎05-15-2012

Here's the full text of cilk_for_recursive():

/*
 * cilk_for_recursive
 *
 * Templatized function to implement the recursive divide-and-conquer
 * algorithm that's how we implement a cilk_for.
 *
 * low   - Low loop index we're considering in this portion of the algorithm
 * high  - High loop index we're considering in this portion of the algorithm
 * body  - lambda function for the cilk_for loop body
 * data  - data used by the lambda function
 * grain - grain size
 * loop_root_pedigree - __cilkrts_pedigree node we generated for the root of
 *         the cilk_for loop to flatten out the internal nodes
 */
template 
static
void cilk_for_recursive(count_t low, count_t high,
                               F body, void *data, int grain,
                               __cilkrts_pedigree *loop_root_pedigree)
{

tail_recurse:
    count_t count = high - low;
    // Invariant: count > 0, grain >= 1
    if (count > grain)
    {
        // Invariant: count >= 2
        count_t mid = low + count / 2;
        _Cilk_spawn cilk_for_recursive(low, mid, body, data, grain,
                                       /* sf->worker, */ loop_root_pedigree);
        low = mid;
        goto tail_recurse;
    }

    // Call the cilk_for loop body lambda function passed in by the compiler to
    // execute one grain
    call_cilk_for_loop_body(low, high, body, data,
                                     loop_root_pedigree);

}

By default, grainsize is caculated asloop_range/8P, where "P" is the number of workers. This usually gives you enough "slackness" to spread the work around and compensate for varying amounts of work in each grain. You can override using a pragma, but we find that in most cases, you'll only slow things down if you play with grainsize. Or at best, tune for a particular number of workers. We've joked that the correct settings for most users are either "1" or "many."

- Barry

chrisfeist · ‎05-15-2012

Okay that makes sense. Thanks for the great responses!
Yes, I've already read elsewhere, that the grainsize is calculated to best fit, so I won't really be playing around with that. Maybe just to see what happens!

I'm still wondering about a few things though. I've written a small program that uses a reducer and a cilk_for loop with a nested for loop. And the regular version seems to be running quite a bit faster than the parallelized version, even without the reducer (which results in wrong results anyways):

#include
#include
#include
#include
#include

using namespace std;

int main()
{
cilk::reducer_opadd sum;

srand(time(NULL));
struct timeval start_t, end_t;
gettimeofday(&start_t, 0);

cilk_for (int i = 0; i < 1000; i++)
{
for (int j = 0; j < i; j++)
sum += j;
}

gettimeofday(&end_t, 0);
double wall_time_in_ms = (end_t.tv_sec - start_t.tv_sec) * 1000.0 +
(end_t.tv_usec - start_t.tv_usec)/1000.0;
cout << "Parallel result: " << sum.get_value() << " in: " << wall_time_in_ms << "ms" << endl;

sum.set_value(0);
gettimeofday(&start_t, 0);

for (int i = 0; i < 1000; i++)
{
for (int j = 0; j < i; j++)
sum += j;
}

gettimeofday(&end_t, 0);
wall_time_in_ms = (end_t.tv_sec - start_t.tv_sec) * 1000.0 +
(end_t.tv_usec - start_t.tv_usec)/1000.0;
cout << "Actual result: " << sum.get_value() << " in: " << wall_time_in_ms << "ms" << endl;

return 0;
}

Am I generating too much overhead here? The only thing I can imagine, is that there are too many iterations with too little work, which would drive the overhead up a lot, wouldn't it?

Cheers,
Chris

Barry_T_Intel · ‎05-15-2012

You got it. There's *VERY* little work to amortize the additional overhead of the cilk_for over.

And even if you cranked the number of iterations (say to 1000000 instead of 1000) you're still dealing with a very unbalanced set of work. So you won't see linear performance improvements.

- Barry

Barry_T_Intel · ‎05-15-2012

> even without the reducer (which results in wrong results anyways)

I'm getting identical results from your test application, both parallel and serial:

icc chris.cpp
./a.out
Parallel result: 1661677000 in: 10.296ms
Actual result: 1661677000 in: 0.087ms

What difference did you see? And which compiler are you using?

- Barry

chrisfeist · ‎05-15-2012

Thanks for the quick reply.
I meant, that I get wrong results when I don't use the reducers, which is to be expected. Sorry for the mix up there.

Cheers,
Chris