- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a quick question concering the strategy, that the Cilk Plus runtime system uses to divide up cilk_for iterations. I saw a few examples that showed how this worked when you have a number of iterations that is a power of 2, but how is this done in a general case? Does the grain size play a role here? Or is the number of iterations upped to the next power of 2 and then some kind of work stealing kicks in by those workers, who have nothing to do?
If anyone could give me some info on that, or direct me to a file/documentation that explains how this works, I would be very grateful!
Cheers,
Chris
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The source is available, both in the GCC 4.7 "cilkplus" branch, and at the Cilk website: http://software.intel.com/en-us/articles/download-intel-cilk-plus-source/. cilk_for is implemented in cilk-abi-cilk-for.cpp. The function you're looking for is cilk_for_recursive()
But to answer your question, it simply divides the range in half, spawning half and calling half. Simple integer division:
count_t mid = low + count / 2;
- Barry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here's the full text of cilk_for_recursive():
/* * cilk_for_recursive * * Templatized function to implement the recursive divide-and-conquer * algorithm that's how we implement a cilk_for. * * low - Low loop index we're considering in this portion of the algorithm * high - High loop index we're considering in this portion of the algorithm * body - lambda function for the cilk_for loop body * data - data used by the lambda function * grain - grain size
* loop_root_pedigree - __cilkrts_pedigree node we generated for the root of * the cilk_for loop to flatten out the internal nodes */ templatestatic void cilk_for_recursive(count_t low, count_t high, F body, void *data, int grain, __cilkrts_pedigree *loop_root_pedigree) { tail_recurse: count_t count = high - low; // Invariant: count > 0, grain >= 1 if (count > grain) { // Invariant: count >= 2 count_t mid = low + count / 2; _Cilk_spawn cilk_for_recursive(low, mid, body, data, grain, /* sf->worker, */ loop_root_pedigree); low = mid; goto tail_recurse; } // Call the cilk_for loop body lambda function passed in by the compiler to // execute one grain call_cilk_for_loop_body(low, high, body, data, loop_root_pedigree); }
By default, grainsize is caculated asloop_range/8P, where "P" is the number of workers. This usually gives you enough "slackness" to spread the work around and compensate for varying amounts of work in each grain. You can override using a pragma, but we find that in most cases, you'll only slow things down if you play with grainsize. Or at best, tune for a particular number of workers. We've joked that the correct settings for most users are either "1" or "many."
- Barry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, I've already read elsewhere, that the grainsize is calculated to best fit, so I won't really be playing around with that. Maybe just to see what happens!
I'm still wondering about a few things though. I've written a small program that uses a reducer and a cilk_for loop with a nested for loop. And the regular version seems to be running quite a bit faster than the parallelized version, even without the reducer (which results in wrong results anyways):
#include
#include
#include
#include
#include
using namespace std;
int main()
{
cilk::reducer_opadd
srand(time(NULL));
struct timeval start_t, end_t;
gettimeofday(&start_t, 0);
cilk_for (int i = 0; i < 1000; i++)
{
for (int j = 0; j < i; j++)
sum += j;
}
gettimeofday(&end_t, 0);
double wall_time_in_ms = (end_t.tv_sec - start_t.tv_sec) * 1000.0 +
(end_t.tv_usec - start_t.tv_usec)/1000.0;
cout << "Parallel result: " << sum.get_value() << " in: " << wall_time_in_ms << "ms" << endl;
sum.set_value(0);
gettimeofday(&start_t, 0);
for (int i = 0; i < 1000; i++)
{
for (int j = 0; j < i; j++)
sum += j;
}
gettimeofday(&end_t, 0);
wall_time_in_ms = (end_t.tv_sec - start_t.tv_sec) * 1000.0 +
(end_t.tv_usec - start_t.tv_usec)/1000.0;
cout << "Actual result: " << sum.get_value() << " in: " << wall_time_in_ms << "ms" << endl;
return 0;
}
Am I generating too much overhead here? The only thing I can imagine, is that there are too many iterations with too little work, which would drive the overhead up a lot, wouldn't it?
Cheers,
Chris
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You got it. There's *VERY* little work to amortize the additional overhead of the cilk_for over.
And even if you cranked the number of iterations (say to 1000000 instead of 1000) you're still dealing with a very unbalanced set of work. So you won't see linear performance improvements.
- Barry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm getting identical results from your test application, both parallel and serial:
icc chris.cpp
./a.out
Parallel result: 1661677000 in: 10.296ms
Actual result: 1661677000 in: 0.087ms
What difference did you see? And which compiler are you using?
- Barry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I meant, that I get wrong results when I don't use the reducers, which is to be expected. Sorry for the mix up there.
Cheers,
Chris
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page