Cilk Plus Perfomance - Overhead - Page 2

knikas · ‎12-21-2011

Hi all, I downloaded and built the gcc47-cilkplus branch. I managed to built the library using binutils version 2.20.1, as I had some problems with versions 2.22 and 2.18. I then tried to experiment with the following code:

[bash]#include 
#include 

int fib (int n) { 
   if (n<2) return n; 
   else { 
          int x, y; 
#ifdef cilk 
          x = cilk_spawn fib (n-1); 
          y = cilk_spawn fib (n-2);
          cilk_sync;
#else 
          x = fib(n-1);
          y = fib(n-2);
#endif 
          return (x+y); 
 } } 

int main (int argc, char *argv[]) { 
   int n, result; struct timeval t1, t2; 
   double time; n = atoi(argv[1]); 
   gettimeofday(&t1, 0); 
#ifdef cilk 
   result = cilk_spawn fib(n); 
   cilk_sync; 
#else 
   result = fib(n); 
#endif 
   gettimeofday(&t2, 0); 
   time=(double)((t2.tv_sec-t1.tv_sec)*1000000+t2.tv_usec-t1.tv_usec)/1000000; 
   printf ("Result: %d\n", result); 
   printf("Time : %.4f\n", time); 
   return 0; 
}[/bash]

For the compilation I export LD_LIBRARY_PATH and LIBRARY_PATH and I run:

[bash]gcc -Wall -Wno-unused-variable -O3 -o fib-serial-gcc47-O3 fib.c -ldl -lcilkrts
gcc -Wall -Wno-unused-variable -O3 -Dcilk -o fib-cilk fib.c -ldl -lcilkrts[/bash]

for the serial and the parallel version. I limit the CILK_NWORKERS to 1 and I get the following results for fib(40):

serial :Time : 1.0540

cilk :Time : 27.3902

Even for 8 workers (I have an 8-core machine) cilk reports3.4457...still slower than the serial version! Something seems to be wrong. Maybe something to do with code optimizations? Or do you think something went wrong while building gcc?

Kind regards,

Kostis

jimdempseyatthecove · ‎01-19-2012

>> but also exploit nested parallelism in both sides

This is correct, but the call to g() is not via cilk_spawn, and thus reduces the number of spawns by ~1/2.
While this is good programming practice, it is not following the problem premise of a fully parallel recursive algorithm.

In the original TBB example, they incorporated a CutOff threshold. If the (remaining) n was belowthis threshold the Serial Fib function is called. This too, while good programming practice, circumvents fully parallel-ness of the implimentation.

The point being made here is "fully parallel" isn't always "fully best"... and this is part of the learning experiance of this problem.

As an extended example of avoiding unnecessary parallelism in a recursive parallel implementation (I'll wait for your "that's cheating"). Consider the QuickThread implimentation. QuickThread has a function call that will return the number of idle threads. The following code will use this function to determine if it is advantageous to spawn the recursion or call the recursion.

[cpp]// support task for qtParallelFib( int64_t n );
// returns result indirect pointer to argument as opposed to return value
// (tasks have void return)
void qtParallelFibLambda( int64_t n, int64_t* result)
{
   // see if below cutoff
   if(n      // no waiting threads
      qtParallelFibLambda(n-2,&y);
      qtParallelFibLambda(n-1,&x);
   }
   // join legs
   *result = x + y;
}

// Shell function to return result as return value
// as opposed to indirect pointer to result location.
int64_t qtParallelFibLambda( int64_t n )
{
   int64_t	result;
   qtParallelFibLambda(n, &result);
   return result;
}

[/cpp]

Is this better programming - yes, is it "fully parallel" - no.

Jim Dempsey

BradleyKuszmaul · ‎01-26-2012

Consider these two code fragments:

fun1() {

spawn f();

spawn g();

sync;

}

fun2() {

spawn f()

g();

sync;

}

In this case fun1 and fun2 offer the same amount of parallelism. There is no sense in which fun2 is less parallel than fun1. It doesn't make sense to say "half the calls are serial", because serialization is a property that relates two different parts of the code. In both of these examples, f() and g() execute in parallel. To say that g() is serial, you have to say what it is serial with. In the second example, g() is serial with the sync, but the sync serializes on the previously spawned functions, so g() is also serial with the sync in the first example.

Read Jim Sukha's post carefully. He knows what he's talking about.

BradleyKuszmaul · ‎01-26-2012

I have also seen performance anomalies in recent versions of Cilk. I have a big code, and when I enable Cilk it slows everything down. This is preventing me from shipping product that uses Cilk.

I also have a "PR" problem with the Cilk scheduler. When there is nothing to steal, the worker threads happily chew up all the otherwise unused CPU cycles. My beta customers don't like that, since they think it will impact other processes running on the same server. It would be really helpful if some way could be found to make it so that when users run "top" they don't see my software chewing up all available CPU cycles.

jimdempseyatthecove · ‎01-31-2012

Bradley,

>>I have a big code, and when I enable Cilk it slows everything down.

I suspect that you are a new user to parallel programming and thus using Cilk improperly.

There is an adage: 'Vector -inner, parallel - outer"

It looks like you may have profiled the serial version of your big code, found the hot spot (inner most) routines, and parallelized those using Cilk Plus.

The better practice is to see if you can vectorize the inner most routines, then parallelize the code in the uppermost appropriate level possible that calls those routines. In following this practice you reduce the frequency of spawn-join. spawn-join has an cost. It is your job as the programmer to assure the total cost is appropriate for the total benefit.

>>"PR" problem with the Cilk scheduler

Most tasking schedulers: Cilk Plus, TBB, OpenMP, QuickThread, employ a technique where an assumption is made that when you terminate a parallel region, that you will very soon enter a next parallel region. This is preponderantly true for nearly all applications. With this assumption in place, the better programming strategy is to, to take an analogy, place the race car at the starting line with engine rev-ing, clutch depressed, and in gear. When the start flag comes down - pop the clutch and immediatelytake off. The alternative, is when momentarily done is to park the car in the garage and turn the engine off. And in which case the time to start the engine and drive back to the starting line is an additional latency. Additionally, by parking the thread you may be yielding some time you left on your rental agreement (with the O/S). When it is time to start the impending parallel region, one or more of your threads may have to wait for some other task/application expends its time with the thread. (some systems are preemptive time-sliced, others are not).
The trade-off here is: CPU cycles (fuel/engine wear) vs latency (time to go).

If you read the programmer's guide for the parallel programming tool you may find that most of these provide a tuning parameter for you to specify how long the thread (race car) waits (revving its engine) before deciding to suspend (park in the garage). In OpenMP this is the KMP_BLOCKTIME. I will leave it an exercise for yourself to discover if this tuning parameter is available in Cilk Plus (you never know what additional useful information you may discover while digging around).

Before you go in to setting the block time to 0, examine your current parallel programming attempt to see if you can push the parallelization to outer levels - as I suspect that this is the root of your problem.

Jim Dempsey

BradleyKuszmaul · ‎01-31-2012

>> I suspect that you are a new user to parallel programming and thus using Cilk improperly.

I was one of the original authors of Cilk in 1992. I've written many high-performance codes using both MIT Cilk and Intel Cilk, and most of the codes are available on the web. For example, in addition to my academic papers on the web, you can find a bunch of my Cilk code on the intel forums (by looking through my postings).

I might find your Cilk-related comments to be more helpful if they related to some specific experience you have with Cilk, rather than extrapolating from some other parallel programming environment such as OpenMP. I'm sure you know more than I do about OpenMP, but it's not Cilk. For example,Cilk does not require a parameter corresponding to KMP_BLOCKTIME.

I will leave it as an exercise for yourself to learn something about Cilk (and please share your experiences).

I could be using Cilk improperly. I did check to make sure that I have released the parking brake, however.

-Bradley

jimdempseyatthecove · ‎02-01-2012

You are correct in that I am not a big-time Cilk Plus programmer, I use other paradigms.

The OpenMPKMP_BLOCKTIME is not a task call parameter, rather it is a global setting that specifies a "spin wait" time before a thread suspends itself. There is also a runtime function call you can use on some implimentations to dynamically vary the spin-wait time.

When your code is:

serial
parallel
short serial
parallel
short serial
...

You may find it beneficial to have a block time that is longer than the runtime of the short serial. "blocktime" is a misnomer, it really means "maximum wait time".

When your code is

serial
parallel
long serial (seconds)
parallel
long serial
...

Then you may find zero wait time beneficial. As this conserves power, may speed-up the serial code on "Turbo-Boost" type processors, and will release processing capabilities to other threads/applicaitons

I would be very surprised to find thatCilk Plusdoes not have a tuningmeans for the wait-time (block time).

-----------

At the top of this forum thread, the recusive Fibbonacci series was used as an example program for recursive parallel programming.

Whilethis is an excellent simple example, it is also one of the worst uses of parallel programming and should never be used as a measure of expectation what parallel programming will do for your applicaiton. Any reasonable parallel programming guide will state somewhere something to the effect: "The cost of invoking a parallel region is not zero. This cost must be factored in to the decision of the use and placement of the parallel regions."

In the Fib series, the computation is "x = y + z" (plus the inordinate overhead of redundantly calling itself).
In the cases where you can get the parallel version to run faster, you still suffer from "redundantly calling itself" syndrom.

The following is a faster serial recursive method that removes redundancy:

__int64 SerialFibMedium( int n, int64& Prior)

{

if( n==2 )

{

Prior = 1;

return 1;

}

__int64 PriorPrior;

Prior = SerialFibMedium( n-1, PriorPrior );

return Prior + PriorPrior;

}

__int64 SerialFibMedium( int n )

{

if( n<2 )

return n;

__int64 Prior;

return SerialFibMedium( n, Prior);

}

Initial call is made to the second routine.

While you can still "parallize this", the cost of one spawn will exceed the runtime of the serial code with an "n" sufficient to cause an overflow in __int64 output.

The point I am making is do not parallize code when it is not appropriate.

Jim Dempsey

BradleyKuszmaul · ‎02-01-2012

> I would be very surprised to find thatCilk Plusdoes not have a tuningmeans for the wait-time (block time).

MIT Cilk did not have such a tuning knob. I've never seen one in Intel Cilk. Surprise!

I'm not really interested in reading about OpenMP on a Cilk forum.

BradleyKuszmaul · ‎02-01-2012

I think that one can learn a lot about a parallel programming environment by examining how it does on this kind of simple code.

When I use icc to take this measurement on my laptop (which has two real cores, with HT enabled, so Cilk thinks there are 4 cores), here is what I see:

[bash]$ icc fib.c -o fib -O2 -cilk-serialize
$ ./fib 40
Result: 102334155
Time : 0.5314
$ icc fib.c -o fib -O2
$ ./fib 40
Result: 102334155
Time : 4.2463
$ CILK_NWORKERS=1 ./fib 40
Result: 102334155
Time : 10.9888
[/bash]

So there's about a 20x overhead for using Cilk for this example on ICC. And on my laptop I see a 2.58-fold speedup, which is pretty good considering that it's really only two cores.

By using Jim's optimization (don't spawn the second call to fib), the 1-worker speed goes to 6.44s and the 4-worker speed to 2.26s.

One can argue that this isn't the right way to write a fib program, but that's not the point. The point is that we can see what the overhead of the cilk mechanism is. A spawn costs about the same as 20 function calls.

If you do care about making fib run faster, then even the linear-time fib is too slow. There is a log-time solution based on matrix multiplication that runs in only O(log n) multiplies. Consider this 2x2 matrix, call it M.

(1 1)

(0 1)

Take it to the power of n, and you'll see that fib(n) appears in the upper right corner. M^n can be calculated in log time by repeated squaring. For example M^40 can be calculated by

M^32 * M^8

where M^8 can be calculated as (((M^2)^2)^2)

and M^32 can be calculated as (((M^8)^2)^2)

The number of bits in the output grows quickly (roughly speaking fib(n) requires about than 2n/3 bits), and the last multiply may dominate the cost. It probably takes something like O(n log n log log n) bit operations to do it using a fast multiplication algorithm. (I haven't worked this out carefully, so it might be off by a little.) The fast multiply can probably run in parallel, so you can make it faster on a multiprocessor.

Incidentally, the linear time algorithm, performs n additions, and length of the number grows by a constant number of bits on each step, so it takes O(n^2) bit operations.

So if I have a machine with 64 gigabits of storage (like my laptop), then I can hope to compute fib(10,000,000,000) without too much trouble. (I don't think I want to wait around for the answer from the linear-number-of-adds algorithm.)

This would be a great project for a budding Cilk programmer to do, and to write up.

-Bradley

jimdempseyatthecove · ‎02-02-2012

Bradley,

Good summary.

>>A spawn costs about the same as 20 function calls

In this example it would be 20x redundant recursions.

To measure cost ratio to function call

// assure this function is not inlined
void fooInc(int& foo)
{
++foo;
}

int foo = 0;
// time this
for(int i=0; i!=yourLimit; ++i)
fooInc(foo);

// time this
// ignore the fact that foo is not atomically incrimented
// we areonly interested in the spawn overhead
for(int i=0; i < yourLimit; ++i)
{
cilk_spawn fooInc(foo);
cilk_synch;
}

// time this
// ignore the fact that foo is not atomically incrimented
// we areonly interested in the spawn overhead
for(int i=0; i < yourLimit; i += 2)
{
cilk_spawn fooInc(foo);
cilk_spawn fooInc(foo);
cilk_synch;
}

... increasing number of spawns and += n for the number of spawns of interest

This will give you a representative idea of the overhead.

-----------------------------

Additional test that may be of interest

replace fooInc with

// compile such that not inlined
void stall(__int64 n)
{
__int64end = _rdtsc() + n;
while(_rdtsc() < end)
continue;
}
//...

__int64 n;
for(n=0; n < 10000; ++n)
{
__int64 begin = _rdtsc();
cilk_for( int i=0; i < 10000; ++i)
stall(n);
__int64 elapse = _rdtsc() - begin;
if(elapse < n)
break;
}
cout << "break even = " << n << endl;

The break even will vary depending on number of worker threads.

Jim Dempsey

BradleyKuszmaul · ‎02-02-2012

Please go ahead and measure the things you find interesting. And if you want to report it for some other tool too, it might be interested. I measured what I thought was interesting.

The reason we program in parallel is for performance, and the exponential-time fib program shows us something about how to get high performance on parallel codes. Knowing that the spawn overhead is 20 function calls tells me that I should coursen the base case of the recursion so that something on the order of 20 function calls are done without Cilk. Then spawn cost will match the non-spawn cost, and I'll be within a factor of two of optimal. I might want to make 200 function calls in the base case to get the spawn cost to be less than 5% of the runtime.

The idea is to understand how you might transform your code to get high performance. If your recursive program already performs at least ten microseconds worth of work at the leaves of the computation, then you don't have to do anything. If it's a lot less, like 1ns worth of work, you will be paying a lot for the overhead of spawn.

So let's see what happens: Code is shown far below. I cleaned up the code so that it can be called with -W -Wall and not emit any errors. This code takes two arguments N the number to compute fib(N), and M the size of the base case (that is compute any fib(M) using the serial code. It's still exponential-time code (that's not the optimization we're trying to understand). The game here is to show how by coursening the base case, one can avoid spawn overhead.

So here's the result for computing fib(40) [The original wallpaper is shown far below.]

[bash]Serial code: 0.7s
 one worker:
   base=2:  6.65s
   base=3:  4.22s
   base=4:  2.85s
   base=5:  2.07s
   base=6:  1.49s
   base=7:  1.19s
   base=8:  0.96s
   base=9:  0.85s
   base=10: 0.77s
   base=11: 0.72s
   base=12: 0.69s
   base=13: 0.66s
   base=14: 0.66s

 two workers:
   base=14: 0.35s
 four workers:
   base=14: 0.25s

Speedup two workers: 2.00
Speedup four workers: 2.79[/bash]

Since fib(8) is 21, that's where the coursened base case is close to the 20x overhead. So the function call for fib1(8) costs about the same as a spawn. At that point we might expect the spawns to take half the time and the work to take the other half. In fact at that point the spawn overhead is about 1/3 of the runtime. That's not too bad for an estimate. By the time we get to fib(14) for the base case (fib(14)=377)), we've essentially removed all the cilk overhead. If we go from that point to running on two cores, we get a speedup of 2.00 over the serial code, and with 4 workers (again on my laptop with 2 cores running HT), the speedup is 2.79.

Here's the code:

[cpp]#include 
#include   
#include 
#include   
  
static int basecasesize=2;

static int fib1 (int n) __attribute__((const));
static int fib1 (int n) {
    if (n<2) return n;
    else {
#pragma warning (disable:981)
	return fib1(n-1)+fib1(n-2);
#pragma warning (default:981)
    }
}

static int fib (int n) {   
    if (n

And here's the wallpaper. Note that with icc, you can compile the cilk code with all the cilk features nullified by using -cilk-serialize.

[bash]$ icc fib.c -o fib-s -O2 -Wall -W -cilk-serialize
$ ./fib-s 40 2
Result: 102334155
Time : 0.6965
$ icc fib.c -o fib -O2 -Wall -W
$ CILK_NWORKERS=1 ./fib 40 2
Result: 102334155
Time : 6.6448
$ CILK_NWORKERS=1 ./fib 40 3
Result: 102334155
Time : 4.2223
$ CILK_NWORKERS=1 ./fib 40 4
Result: 102334155
Time : 2.8528
$ CILK_NWORKERS=1 ./fib 40 5
Result: 102334155
Time : 2.0746
$ CILK_NWORKERS=1 ./fib 40 6
Result: 102334155
Time : 1.4851
$ CILK_NWORKERS=1 ./fib 40 7
Result: 102334155
Time : 1.1948
$ CILK_NWORKERS=1 ./fib 40 8
Result: 102334155
Time : 0.9635
$ CILK_NWORKERS=1 ./fib 40 9
Result: 102334155
Time : 0.8447
$ CILK_NWORKERS=1 ./fib 40 10
Result: 102334155
Time : 0.7685
$ CILK_NWORKERS=1 ./fib 40 11
Result: 102334155
Time : 0.7251
$ CILK_NWORKERS=1 ./fib 40 12
Result: 102334155
Time : 0.6851
$ CILK_NWORKERS=1 ./fib 40 13
Result: 102334155
Time : 0.6585
$ CILK_NWORKERS=1 ./fib 40 14
Result: 102334155
Time : 0.6569
$ CILK_NWORKERS=2 ./fib 40 14
Result: 102334155
Time : 0.3479
$ CILK_NWORKERS=4 ./fib 40 14
Result: 102334155
Time : 0.2489
[/bash]