Software Archive
Read-only legacy content
17061 Discussions

cilk_for's divide and conquer strategy

chrisfeist
Beginner
572 Views
Hello,

I have a quick question concering the strategy, that the Cilk Plus runtime system uses to divide up cilk_for iterations. I saw a few examples that showed how this worked when you have a number of iterations that is a power of 2, but how is this done in a general case? Does the grain size play a role here? Or is the number of iterations upped to the next power of 2 and then some kind of work stealing kicks in by those workers, who have nothing to do?

If anyone could give me some info on that, or direct me to a file/documentation that explains how this works, I would be very grateful!

Cheers,
Chris
0 Kudos
7 Replies
Barry_T_Intel
Employee
572 Views

The source is available, both in the GCC 4.7 "cilkplus" branch, and at the Cilk website: http://software.intel.com/en-us/articles/download-intel-cilk-plus-source/. cilk_for is implemented in cilk-abi-cilk-for.cpp. The function you're looking for is cilk_for_recursive()

But to answer your question, it simply divides the range in half, spawning half and calling half. Simple integer division:

count_t mid = low + count / 2;

- Barry

0 Kudos
BradleyKuszmaul
Beginner
572 Views
Adding to Barry's response: The grain size does play an issue. It determines when the recursion stops.
0 Kudos
Barry_T_Intel
Employee
572 Views

Here's the full text of cilk_for_recursive():

/*
 * cilk_for_recursive
 *
 * Templatized function to implement the recursive divide-and-conquer
 * algorithm that's how we implement a cilk_for.
 *
 * low   - Low loop index we're considering in this portion of the algorithm
 * high  - High loop index we're considering in this portion of the algorithm
 * body  - lambda function for the cilk_for loop body
 * data  - data used by the lambda function
 * grain - grain size
* loop_root_pedigree - __cilkrts_pedigree node we generated for the root of * the cilk_for loop to flatten out the internal nodes */ template static void cilk_for_recursive(count_t low, count_t high, F body, void *data, int grain, __cilkrts_pedigree *loop_root_pedigree) { tail_recurse: count_t count = high - low; // Invariant: count > 0, grain >= 1 if (count > grain) { // Invariant: count >= 2 count_t mid = low + count / 2; _Cilk_spawn cilk_for_recursive(low, mid, body, data, grain, /* sf->worker, */ loop_root_pedigree); low = mid; goto tail_recurse; } // Call the cilk_for loop body lambda function passed in by the compiler to // execute one grain call_cilk_for_loop_body(low, high, body, data, loop_root_pedigree); }

By default, grainsize is caculated asloop_range/8P, where "P" is the number of workers. This usually gives you enough "slackness" to spread the work around and compensate for varying amounts of work in each grain. You can override using a pragma, but we find that in most cases, you'll only slow things down if you play with grainsize. Or at best, tune for a particular number of workers. We've joked that the correct settings for most users are either "1" or "many."

- Barry

0 Kudos
chrisfeist
Beginner
572 Views
Okay that makes sense. Thanks for the great responses!
Yes, I've already read elsewhere, that the grainsize is calculated to best fit, so I won't really be playing around with that. Maybe just to see what happens!

I'm still wondering about a few things though. I've written a small program that uses a reducer and a cilk_for loop with a nested for loop. And the regular version seems to be running quite a bit faster than the parallelized version, even without the reducer (which results in wrong results anyways):

#include
#include
#include
#include
#include

using namespace std;

int main()
{
cilk::reducer_opadd sum;

srand(time(NULL));
struct timeval start_t, end_t;
gettimeofday(&start_t, 0);

cilk_for (int i = 0; i < 1000; i++)
{
for (int j = 0; j < i; j++)
sum += j;
}

gettimeofday(&end_t, 0);
double wall_time_in_ms = (end_t.tv_sec - start_t.tv_sec) * 1000.0 +
(end_t.tv_usec - start_t.tv_usec)/1000.0;
cout << "Parallel result: " << sum.get_value() << " in: " << wall_time_in_ms << "ms" << endl;

sum.set_value(0);
gettimeofday(&start_t, 0);

for (int i = 0; i < 1000; i++)
{
for (int j = 0; j < i; j++)
sum += j;
}

gettimeofday(&end_t, 0);
wall_time_in_ms = (end_t.tv_sec - start_t.tv_sec) * 1000.0 +
(end_t.tv_usec - start_t.tv_usec)/1000.0;
cout << "Actual result: " << sum.get_value() << " in: " << wall_time_in_ms << "ms" << endl;

return 0;
}

Am I generating too much overhead here? The only thing I can imagine, is that there are too many iterations with too little work, which would drive the overhead up a lot, wouldn't it?

Cheers,
Chris
0 Kudos
Barry_T_Intel
Employee
572 Views

You got it. There's *VERY* little work to amortize the additional overhead of the cilk_for over.

And even if you cranked the number of iterations (say to 1000000 instead of 1000) you're still dealing with a very unbalanced set of work. So you won't see linear performance improvements.

- Barry

0 Kudos
Barry_T_Intel
Employee
572 Views
> even without the reducer (which results in wrong results anyways)

I'm getting identical results from your test application, both parallel and serial:

icc chris.cpp
./a.out
Parallel result: 1661677000 in: 10.296ms
Actual result: 1661677000 in: 0.087ms

What difference did you see? And which compiler are you using?

- Barry
0 Kudos
chrisfeist
Beginner
572 Views
Thanks for the quick reply.
I meant, that I get wrong results when I don't use the reducers, which is to be expected. Sorry for the mix up there.

Cheers,
Chris

0 Kudos
Reply