- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
[bash]#include#include int fib (int n) { if (n<2) return n; else { int x, y; #ifdef cilk x = cilk_spawn fib (n-1); y = cilk_spawn fib (n-2); cilk_sync; #else x = fib(n-1); y = fib(n-2); #endif return (x+y); } } int main (int argc, char *argv[]) { int n, result; struct timeval t1, t2; double time; n = atoi(argv[1]); gettimeofday(&t1, 0); #ifdef cilk result = cilk_spawn fib(n); cilk_sync; #else result = fib(n); #endif gettimeofday(&t2, 0); time=(double)((t2.tv_sec-t1.tv_sec)*1000000+t2.tv_usec-t1.tv_usec)/1000000; printf ("Result: %d\n", result); printf("Time : %.4f\n", time); return 0; }[/bash]
[bash]gcc -Wall -Wno-unused-variable -O3 -o fib-serial-gcc47-O3 fib.c -ldl -lcilkrts gcc -Wall -Wno-unused-variable -O3 -Dcilk -o fib-cilk fib.c -ldl -lcilkrts[/bash]for the serial and the parallel version. I limit the CILK_NWORKERS to 1 and I get the following results for fib(40):
Enlace copiado
- « Anterior
-
- 1
- 2
- Siguiente »
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
I cut and pasted your code, compiled it using the tags you used and ran it for fib 40. My parallel fib was way faster than serial fib.
One thing I notice is that you are using #include <syst/time.h> and my compiler tagged it as error. I changed it to <sys/time.h>
Do you have multiple or proprietarygettimeofday installed in your machine?
Thanks,
Balaji V. Iyer.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
using this
[bash]x = cilk_spawn fib (n-1); // -> executed by X...is very inefficient. What you're doing here is spawning two new threads (Y & Z) - thread X is the main thread entering the fib(...) function initially. Whilst threads X & Y are doing work, thread Z is just there to wait.
// created thread Y that continues here
y = cilk_spawn fib (n-2); // -> executed by Y
// created thread Z that continues here
cilk_sync; // synchronize: wait for X, Y & Z to finish[/bash]
Better solution:
[bash]x = cilk_spawn fib (n-1); y = fib (n-2); cilk_sync;[/bash]
This should reduce your overhead.
Best regards,
Georg Zitzlsberger
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
When Cilk was implemented, we did not modify the existing GCC scheduler. Also, I have seen that GCCinstruction schedulerdoes not do an astonishing job in scheduling functions inside another function (A spawn helper is implemented that way). This is probably why you are not seeing much difference in time.
Thanks,
Balaji V. Iyer.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
When you run this code with CILK_NWORKERS=1 does it execute slower than the serial code on your machine? And if yes, how much slower? Because, if you are not seeing huge differences then the problem must be with my gcc build. My build is based on the code commited in the SVN on 08/12/2011 by hjl.
Kind regards,
Kostis
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
We are currently looking into this. Although, please take note that a 10X overhead for a program thatis almost pure-overheadis not unexpected.
Thanks,
Balaji V. Iyer.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
icc --version
icc (ICC) 12.1.2 20111128
Copyright (C) 1985-2011 Intel Corporation. All rights reserved.
I used it to compile the same code (with -O3 flag) and compare again serial execution vs parallel execution with 1 thread. Here are the results for fib(40):
serial -->Time : 2.6701
parallel (with CILK_NWORKERS=1) -->Time : 29.7030
so similar behavior to using the gcc build. Even when I up the CILK_NWORKERS to 8, it runs in 3.62 secs, which means it is slower than the serial version!
Another reason why this is strange, is that when I used in the past cilk++ for the same experiment I didn't observe that much difference between the serial and the parallel version with one worker. I will try and reproduce this experiment as well and report the numbers.
Kind regards,
Kostis
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
we've observed the same for some Linux distributions, too. It only seems to affect few systems. We're currently investigating...
What Linux distribution are you using and is it 32 or 64 bit?
Thank you for your feedback!
Georg Zitzlsberger
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
all the previous results come from a Debian Wheezy. I have also observed the same behaviour on Fedora-14, Fedora-16 and Ubuntu 11.04. All these are 64-bit distribtions.
Which distribution are you using that does not exhibit this behaviour?
Kind regards,
Kostis
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
I've quickly tested some of our 64 bit systems. Those worked:
SLES10
RHEL5
Fedora 12
Best regards,
Georg Zitzlsberger
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
I'm running Ubuntu 11.10 (64-bit). I took your code and removed the extra spawns and found ~10x overhead on one worker. That seemed pretty high to me, too. On the other hand,recursivefib is a diabolical example since there's basically nothing except for overhead in the algorithm. In that sense, fib is the proper function to use to measure the overhead of calling a Cilk function (as opposed to a C++ function). The overhead, itself, is in every spawning function: larger frame to accommodate Cilk data structures, access TLS to get the worker, no optimizing away the frame pointer, verify that the function is sync'd before returning, etc.
There is, of course, a lot of optimization opportunity in all that -- especially in the case of fib -- and knocking down the overhead would be a big win. Naturally, one anticipates it going down over time, but 10x is not altogether surprising on fib.
Regarding Cilk++, I'd definitely like to see the comparison if you run the experiment. My recollection was something on the order of 3-4x on fib.
For completeness, here is the code I ran to get my timing:
[cpp]#include#include #include typedef int (*fib_t) (int); int sfib (int n) { if (n < 2) return n; else { int x, y; x = sfib(n - 1); y = sfib(n - 2); return x + y; } } int pfib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn pfib(n - 1); y = pfib(n - 2); cilk_sync; return x + y; } } void time_it (fib_t f, int n) { int result; struct timeval t1, t2; double time; gettimeofday(&t1, 0); result = f(n); gettimeofday(&t2, 0); time=(double)((t2.tv_sec-t1.tv_sec)*1000000+t2.tv_usec-t1.tv_usec)/1000000; printf ("Result: %dn", result); printf("Time : %.4fn", time); } int main (int argc, char *argv[]) { int n; n = atoi(argv[1]); time_it(sfib, n); time_it(pfib, n); return 0; } [/cpp]
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
[cpp]// Serial non-recursive method to calculate Fibonacci series // *** Note, several orders of magnitude faster than all // *** recursive techniques. long SerialFib2( long n ) { if( n<2 ) return n; long fib0 = 1; // most recent long fib1 = 1; // 1 prior to most recent long fib2 = 0; // 2 prior to most recent long i; for(i=2; i
the above loop has for O per n of
3 register to register moves
1 register,register add
1 add 1
1 compare
1 branch
(before unrolling by the compiler and/or other optimizations like register swapping)
The above is 7 instructions, possibly 4 clock ticks per iteration.
A call (near) to ret, that is to say a call to a void function, exceeds this.
Which in turn means that your tasking library (Cilk Plus) would require negative clock ticks (in parallel), to surpass the performance of the single threaded iterative process.
The point of the parallel fib is to teach recursive programming, and not to teach how to efficiently program. What the student running the test program should observe is: the cost of tasking as compared to the work per iteration and can this cost be effectively amortized.
Jim Dempsey
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
yes, there are more efficient algorithms to calculate Fibonacci.
The recursive one above has O(2^n), an iterative approach, like yours O(n).
There's even a closed-form version with just O(1).
Anyways, the idea of using Fibonacci with Intel Cilk Plus is to easily demonstrate how it can be applied and how it works. The simplicity helps in understanding the involved tasking model. In general you can use other more complex problems as well (like to see some from the community, not here, but on a dedicated thread!).
You said that the runtime using the recursive algorithm with Intel Cilk Plus should be close to the iterative one (for all n). Well, that's like comparing apples with pears (O(2^n) to O(n) respectively).
What we want to show here is that the same (recursive!) algorithm is running faster with using multiple cores than sequentially.
If I had to implement a mathematical library I'd prefer the closed-form version (for big n) ignoring the recursive & iterative approach altogether. However, we're exercising here...
The problem that's described above is that on some systems we see a reproducible slowdown. The task scheduling usually should be consistent on all systems. Maybe it's just a small problem in the scheduler or runtime library.
Best regards,
Georg Zitzlsberger
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Yes, ... Time for VTune
>>The task scheduling usually should be consistent on all systems.
But not necessarily between systems, nor relative efficiency between systems.
Jim
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Isn't this "same (recursive)" (in parallel)when both legs are spawned?
As opposed to:
Half same recursive in parallel (one legusing task spawn)
Other half same recursive via call stack (other leg using function call)
Don't get me wrong, this example is a good learning tool. It illustrates a case where you split work between the current thread and a worker thread (and recurse down doing the same thing).
Jim Dempsey
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
As I believe was mentioned earlier in the post, spawning the second recursive leg is redundant for this example. In this case, spawning one task and calling the other task is "equivalent" to spawning both halves, because the continuation of the second spawned task would do nothing but immediately sync.
To illustrate the difference more concretely, consider the following two program fragments.
// Example 1:
int x, y= 0;
cilk_spawn f();
x++;
g();
y++;
cilk_sync;
// Example 2:
int x, y = 0;
cilk_spawn f();
x++;
cilk_spawn g();
y++;
cilk_sync;
Assume f() and g() are serial functions.
In the first example, there are two strands that can potentially execute in parallel:
Strand 1 for the initialization of x and y andthen f()
Strand 2 for x++, g() and then y++.
In the second example, there are three strands that can potentially execute in parallel:
Strand 1 for the initialization ofx and y and thenf()
Strand 2 for x++ and then g()
Strand 3 for y++.
Strand 3 can not begin executing until strand 2 has executed x++ and reached the cilk_spawn of g().
For fib(), one could spawn the second recursive call, but since its "strand 3" does nothing, it is only adding overhead.
Cheers,
Jim
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Except for the overhead. You cannot claim the implementation is fully parallel when half the implementation is recursive serial. Apples and oranges.
Using one leg in parallel_is_ faster than using both legs in parallel, and an actual implementation should favor this over both legs in parallel. However, this does not discount the "requirement" of the sample program running fully recursively parallel. To do this you need to run both legs in parallel.
A similar situation is where you want to illustrate a parallel for loop, with, unknown to the compiler, only 1 iteration. This is "equivalent" to issuing the body statement(s) without the loop control and withouttask setup for the Cilk_for. However, making the substitution, then claiming it is equivalent to the parallel implementation is "cooking" the results data.
Making these otherwise "stupid" programming choices is valuable in that it provides you with the means to measure the tasking overhead, and then have the information available to make your programming decisions such as parallelizing one or both legs and what size to set your cut-off threshold.
Jim Dempsey
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Just wanted to clarify for any future readers who might stumble upon this post, because it is possible that programmers new to Cilk and parallel programming might get confused by my original description or the statements"half the implementation is recursive serial" and "you need to run both legs in parallel".
In the following code:
cilk_spawn f();
g();
cilk_sync;
f() and g() can both contain parallelism nested inside, and Cilk will not only allow f() and g() to execute in parallel, but also exploit nested parallelism in both sides, f() and g(). If g() itself has spawns nested inside, the function g() is NOT guaranteed to execute completely serially. In the case of fib(), f and g happen to be the same function, which does recursively spawn.
Thus, in a divide-and-conquer algorithm which divides into two subproblems, the preferred approach isto avoid spawning the second subproblem (i.e., g()). Cilk can still exploit parallelism due to spawns nested insidethe second subproblem g().
Anyway,I'm probably repeating a lot informationthat everyone has summarized in earlier posts. But I just wanted to clarify, becauserecursive divide-and-conquer algorithms are acommon and importantusage pattern for Cilk, and we want to encourage programmers to write efficient Cilk code.
Cheers,
Jim

- Suscribirse a un feed RSS
- Marcar tema como nuevo
- Marcar tema como leído
- Flotar este Tema para el usuario actual
- Favorito
- Suscribir
- Página de impresión sencilla
- « Anterior
-
- 1
- 2
- Siguiente »