Cilk Plus Perfomance - Overhead

knikas · ‎12-21-2011

Hi all, I downloaded and built the gcc47-cilkplus branch. I managed to built the library using binutils version 2.20.1, as I had some problems with versions 2.22 and 2.18. I then tried to experiment with the following code:

[bash]#include 
#include 

int fib (int n) { 
   if (n<2) return n; 
   else { 
          int x, y; 
#ifdef cilk 
          x = cilk_spawn fib (n-1); 
          y = cilk_spawn fib (n-2);
          cilk_sync;
#else 
          x = fib(n-1);
          y = fib(n-2);
#endif 
          return (x+y); 
 } } 

int main (int argc, char *argv[]) { 
   int n, result; struct timeval t1, t2; 
   double time; n = atoi(argv[1]); 
   gettimeofday(&t1, 0); 
#ifdef cilk 
   result = cilk_spawn fib(n); 
   cilk_sync; 
#else 
   result = fib(n); 
#endif 
   gettimeofday(&t2, 0); 
   time=(double)((t2.tv_sec-t1.tv_sec)*1000000+t2.tv_usec-t1.tv_usec)/1000000; 
   printf ("Result: %d\n", result); 
   printf("Time : %.4f\n", time); 
   return 0; 
}[/bash]

For the compilation I export LD_LIBRARY_PATH and LIBRARY_PATH and I run:

[bash]gcc -Wall -Wno-unused-variable -O3 -o fib-serial-gcc47-O3 fib.c -ldl -lcilkrts
gcc -Wall -Wno-unused-variable -O3 -Dcilk -o fib-cilk fib.c -ldl -lcilkrts[/bash]

for the serial and the parallel version. I limit the CILK_NWORKERS to 1 and I get the following results for fib(40):

serial :Time : 1.0540

cilk :Time : 27.3902

Even for 8 workers (I have an 8-core machine) cilk reports3.4457...still slower than the serial version! Something seems to be wrong. Maybe something to do with code optimizations? Or do you think something went wrong while building gcc?

Kind regards,

Kostis

Balaji_I_Intel · ‎12-21-2011

Hello Kostis,
I cut and pasted your code, compiled it using the tags you used and ran it for fib 40. My parallel fib was way faster than serial fib.

One thing I notice is that you are using #include <syst/time.h> and my compiler tagged it as error. I changed it to <sys/time.h>

Do you have multiple or proprietarygettimeofday installed in your machine?

Thanks,

Balaji V. Iyer.

Georg_Z_Intel · ‎12-21-2011

Hello,

using this

[bash]x = cilk_spawn fib (n-1);  // -> executed by X
// created thread Y that continues here
y = cilk_spawn fib (n-2);  // -> executed by Y
// created thread Z that continues here
cilk_sync;  // synchronize: wait for X, Y & Z to finish[/bash]

...is very inefficient. What you're doing here is spawning two new threads (Y & Z) - thread X is the main thread entering the fib(...) function initially. Whilst threads X & Y are doing work, thread Z is just there to wait.

Better solution:

[bash]x = cilk_spawn fib (n-1);
y = fib (n-2);
cilk_sync;[/bash]

This should reduce your overhead.

Best regards,

Georg Zitzlsberger

knikas · ‎12-21-2011

Hi guys,

yes I agree this is a better solution. Still this is not the problem. I am constantly getting this strange overhead comparing the serial version to the parallel (with 1 thread or more) for Fibonnacci, for Floyd-Warsall and other algorithms. There seems to be another problem, something to do with the library and perphaps the way I have built it.

The syst/time was just a typo while copying the code in the forum. I use the usual sys/time.h. AsBalaji is mentioning I would expect the parallel version with 1 worker to e maybe slightly slower than the serial version. And then get faster as the number of cores increases.

Kind regards,

Kostis

knikas · ‎12-21-2011

Following my previous comments, the compilation seems to ignore optimization flags. I compiled the code with O0, O1, O2, O3 flags for both the serial and parallel version. And here are the results

-O0 : serial -->Time : 2.6085 , parallel --> Time :Time : 26.2700

-O1 : serial --> Time :2.2422 , parallel --> Time :Time : 27.1657

-O2 : serial --> Time : 1.9993, parallel --> Time :Time : 28.6697

-O3 : serial --> Time : 1.0544 , parallel --> Time :Time : 27.4960

I remind you this is the time for the parallel version with 1 cilk worker. So, while the optimization flags have an effect on the serial code there seems to be none on the parallel code? Is this normal? Could this be a sign that something has gone wrong in the building process?

Kind regards,

Kostis

Balaji_I_Intel · ‎12-21-2011

Hello Kostis,
When Cilk was implemented, we did not modify the existing GCC scheduler. Also, I have seen that GCCinstruction schedulerdoes not do an astonishing job in scheduling functions inside another function (A spawn helper is implemented that way). This is probably why you are not seeing much difference in time.

Thanks,

Balaji V. Iyer.

knikas · ‎12-21-2011

Hi Balaji, just to clarify...

When you run this code with CILK_NWORKERS=1 does it execute slower than the serial code on your machine? And if yes, how much slower? Because, if you are not seeing huge differences then the problem must be with my gcc build. My build is based on the code commited in the SVN on 08/12/2011 by hjl.

Kind regards,

Kostis

Balaji_I_Intel · ‎12-21-2011

Hi Kostis,
We are currently looking into this. Although, please take note that a 10X overhead for a program thatis almost pure-overheadis not unexpected.

Thanks,

Balaji V. Iyer.

knikas · ‎12-22-2011

Hi guys,

to offer some more info...I downloaded the latest non-commercial version of the composer_xe compiler.

icc --version

icc (ICC) 12.1.2 20111128

I used it to compile the same code (with -O3 flag) and compare again serial execution vs parallel execution with 1 thread. Here are the results for fib(40):

serial -->Time : 2.6701

parallel (with CILK_NWORKERS=1) -->Time : 29.7030

so similar behavior to using the gcc build. Even when I up the CILK_NWORKERS to 8, it runs in 3.62 secs, which means it is slower than the serial version!

Another reason why this is strange, is that when I used in the past cilk++ for the same experiment I didn't observe that much difference between the serial and the parallel version with one worker. I will try and reproduce this experiment as well and report the numbers.

Kind regards,

Kostis

Georg_Z_Intel · ‎12-22-2011

Hello Kostis,

we've observed the same for some Linux distributions, too. It only seems to affect few systems. We're currently investigating...

What Linux distribution are you using and is it 32 or 64 bit?

Thank you for your feedback!

Georg Zitzlsberger

knikas · ‎12-23-2011

Hello Georg,

all the previous results come from a Debian Wheezy. I have also observed the same behaviour on Fedora-14, Fedora-16 and Ubuntu 11.04. All these are 64-bit distribtions.

Which distribution are you using that does not exhibit this behaviour?

Kind regards,

Kostis

Georg_Z_Intel · ‎12-23-2011

Hello Kostis,

I've quickly tested some of our 64 bit systems. Those worked:

SLES10
RHEL5
Fedora 12

Best regards,

Georg Zitzlsberger

William_Leiserson__I · ‎12-27-2011

Hi Knikas,

I'm running Ubuntu 11.10 (64-bit). I took your code and removed the extra spawns and found ~10x overhead on one worker. That seemed pretty high to me, too. On the other hand,recursivefib is a diabolical example since there's basically nothing except for overhead in the algorithm. In that sense, fib is the proper function to use to measure the overhead of calling a Cilk function (as opposed to a C++ function). The overhead, itself, is in every spawning function: larger frame to accommodate Cilk data structures, access TLS to get the worker, no optimizing away the frame pointer, verify that the function is sync'd before returning, etc.

There is, of course, a lot of optimization opportunity in all that -- especially in the case of fib -- and knocking down the overhead would be a big win. Naturally, one anticipates it going down over time, but 10x is not altogether surprising on fib.

Regarding Cilk++, I'd definitely like to see the comparison if you run the experiment. My recollection was something on the order of 3-4x on fib.

For completeness, here is the code I ran to get my timing:

[cpp]#include 
#include 
#include 

typedef int (*fib_t) (int);

int sfib (int n) {
    if (n < 2) return n;
    else {
        int x, y;
        x = sfib(n - 1);
        y = sfib(n - 2);
        return x + y;
    }
}

int pfib (int n) {
    if (n < 2) return n;
    else {
        int x, y;
        x = cilk_spawn pfib(n - 1);
        y = pfib(n - 2);
        cilk_sync;
        return x + y;
    }
}

void time_it (fib_t f, int n)
{
    int result;
    struct timeval t1, t2;
    double time; 
    gettimeofday(&t1, 0);
    result = f(n);
    gettimeofday(&t2, 0);
    time=(double)((t2.tv_sec-t1.tv_sec)*1000000+t2.tv_usec-t1.tv_usec)/1000000;
    printf ("Result: %dn", result);
    printf("Time : %.4fn", time);
}

int main (int argc, char *argv[]) {
    int n;
    n = atoi(argv[1]);
    time_it(sfib, n);
    time_it(pfib, n);
    return 0;
}
[/cpp]

knikas · ‎01-04-2012

Hi William,

I was wondering the same thing, so I performed the same experiment with Cilk++ (build 8503). For fib(40) the serial version took 1.8860 sec while the parallel version with 1 worker took 16.69 sec. So a lot faster than the 27 sec needed with GCC or ICC, but still a huge overhead. I also tried your code (to remove 1 spawn and reduce the overhead), which took 11.2355 sec. So in this case I have a ~10x overhead.

I also tried the cilkc MIT compiler. In this case the parallel execution with 1 worker takes 7.433 sec, so a ~7x overhead.

Kind regards,

Kostis

jimdempseyatthecove · ‎01-17-2012

The parallel Fibonacci is an excellent demonstration of tasking overheadof an inefficient recursive algorithm. For Fibonacci series a fast iterative method by one thread will, IMHO, always be faster than a parallel method.

[cpp]// Serial non-recursive method to calculate Fibonacci series
// *** Note, several orders of magnitude faster than all
// *** recursive techniques.
long SerialFib2( long n )
{
	if( n<2 )
		return n;
	long fib0 = 1;	// most recent
	long fib1 = 1;	// 1 prior to most recent
	long fib2 = 0;	// 2 prior to most recent
	long i;
	for(i=2; i
the above loop has for O per n of

 3 register to register moves
 1 register,register add
 1 add 1
 1 compare
 1 branch

(before unrolling by the compiler and/or other optimizations like register swapping)

The above is 7 instructions, possibly 4 clock ticks per iteration.

A call (near) to ret, that is to say a call to a void function, exceeds this.
Which in turn means that your tasking library (Cilk Plus) would require negative clock ticks (in parallel), to surpass the performance of the single threaded iterative process.

The point of the parallel fib is to teach recursive programming, and not to teach how to efficiently program. What the student running the test program should observe is: the cost of tasking as compared to the work per iteration and can this cost be effectively amortized.

Jim Dempsey

Georg_Z_Intel · ‎01-17-2012

Hello Jim,

yes, there are more efficient algorithms to calculate Fibonacci.
The recursive one above has O(2^n), an iterative approach, like yours O(n).
There's even a closed-form version with just O(1).

Anyways, the idea of using Fibonacci with Intel Cilk Plus is to easily demonstrate how it can be applied and how it works. The simplicity helps in understanding the involved tasking model. In general you can use other more complex problems as well (like to see some from the community, not here, but on a dedicated thread!).

You said that the runtime using the recursive algorithm with Intel Cilk Plus should be close to the iterative one (for all n). Well, that's like comparing apples with pears (O(2^n) to O(n) respectively).

What we want to show here is that the same (recursive!) algorithm is running faster with using multiple cores than sequentially.

If I had to implement a mathematical library I'd prefer the closed-form version (for big n) ignoring the recursive & iterative approach altogether. However, we're exercising here...

The problem that's described above is that on some systems we see a reproducible slowdown. The task scheduling usually should be consistent on all systems. Maybe it's just a small problem in the scheduler or runtime library.

Best regards,

Georg Zitzlsberger

jimdempseyatthecove · ‎01-17-2012

>>The problem that's described above is that on some systems we see a reproducible slowdown

Yes, ... Time for VTune

>>The task scheduling usually should be consistent on all systems.

But not necessarily between systems, nor relative efficiency between systems.

Jim

jimdempseyatthecove · ‎01-18-2012

>>What we want to show here is that the same (recursive!) algorithm is running faster with using multiple cores than sequentially

Isn't this "same (recursive)" (in parallel)when both legs are spawned?

As opposed to:

Half same recursive in parallel (one legusing task spawn)
Other half same recursive via call stack (other leg using function call)

Don't get me wrong, this example is a good learning tool. It illustrates a case where you split work between the current thread and a worker thread (and recurse down doing the same thing).

Jim Dempsey

Jim_S_Intel · ‎01-18-2012

As I believe was mentioned earlier in the post, spawning the second recursive leg is redundant for this example. In this case, spawning one task and calling the other task is "equivalent" to spawning both halves, because the continuation of the second spawned task would do nothing but immediately sync.

To illustrate the difference more concretely, consider the following two program fragments.

// Example 1:
int x, y= 0;
cilk_spawn f();
x++;
g();
y++;
cilk_sync;

// Example 2:
int x, y = 0;
cilk_spawn f();
x++;
cilk_spawn g();
y++;
cilk_sync;

Assume f() and g() are serial functions.
In the first example, there are two strands that can potentially execute in parallel:
Strand 1 for the initialization of x and y andthen f()
Strand 2 for x++, g() and then y++.

In the second example, there are three strands that can potentially execute in parallel:
Strand 1 for the initialization ofx and y and thenf()
Strand 2 for x++ and then g()
Strand 3 for y++.
Strand 3 can not begin executing until strand 2 has executed x++ and reached the cilk_spawn of g().

For fib(), one could spawn the second recursive call, but since its "strand 3" does nothing, it is only adding overhead.

Cheers,

Jim

jimdempseyatthecove · ‎01-18-2012

>>As I believe was mentioned earlier in the post, spawning the second recursive leg is redundant for this example. In this case, spawning one task and calling the other task is "equivalent" to spawning both halves

Except for the overhead. You cannot claim the implementation is fully parallel when half the implementation is recursive serial. Apples and oranges.

Using one leg in parallel_is_ faster than using both legs in parallel, and an actual implementation should favor this over both legs in parallel. However, this does not discount the "requirement" of the sample program running fully recursively parallel. To do this you need to run both legs in parallel.

A similar situation is where you want to illustrate a parallel for loop, with, unknown to the compiler, only 1 iteration. This is "equivalent" to issuing the body statement(s) without the loop control and withouttask setup for the Cilk_for. However, making the substitution, then claiming it is equivalent to the parallel implementation is "cooking" the results data.

Making these otherwise "stupid" programming choices is valuable in that it provides you with the means to measure the tasking overhead, and then have the information available to make your programming decisions such as parallelizing one or both legs and what size to set your cut-off threshold.

Jim Dempsey

Jim_S_Intel · ‎01-18-2012

I agree, there is a lot to learn from studying even seemingly "simple" examples such as fib, and it is definitely important to understand what you are measuring when trying to make performance comparisons.

Just wanted to clarify for any future readers who might stumble upon this post, because it is possible that programmers new to Cilk and parallel programming might get confused by my original description or the statements"half the implementation is recursive serial" and "you need to run both legs in parallel".
In the following code:

cilk_spawn f();
g();
cilk_sync;

f() and g() can both contain parallelism nested inside, and Cilk will not only allow f() and g() to execute in parallel, but also exploit nested parallelism in both sides, f() and g(). If g() itself has spawns nested inside, the function g() is NOT guaranteed to execute completely serially. In the case of fib(), f and g happen to be the same function, which does recursively spawn.
Thus, in a divide-and-conquer algorithm which divides into two subproblems, the preferred approach isto avoid spawning the second subproblem (i.e., g()). Cilk can still exploit parallelism due to spawns nested insidethe second subproblem g().

Anyway,I'm probably repeating a lot informationthat everyone has summarized in earlier posts. But I just wanted to clarify, becauserecursive divide-and-conquer algorithms are acommon and importantusage pattern for Cilk, and we want to encourage programmers to write efficient Cilk code.
Cheers,

Jim