Solved: The overhead of cilk_spawn

Amos_W_ · ‎04-23-2016

Hello,

I read the discussion, https://software.intel.com/en-us/forums/intel-cilk-plus/topic/265700, and have a question about the overhead of cilk_spawn. In theory, Cilk has no overhead in spawn generation because of work first strategy. In fact, parallel version has ten times overhead than serial version. So, why does this overhead come from? I try to figure out what the problem is by running VTune. It turns out that Cilk spends much time on doing '__cilkrts_leave_frame.' I cannot get clear picture by reading assembly code of '__cilkrts_leave_frame'. Is there anyone who knows this procedure? Any insight from run-time side would be helpful.

My program is the basic one as follows.

x = cilk_spawn fib (n-1); 
y = fib (n-2); 
cilk_sync;  
return x+y;

Amos

Pablo_H_Intel · ‎04-25-2016

Actually, although we like to tout how low the overhead is for a cilk_spawn, the fact is that it is realistically 5 to 15 times slower than a regular function call. Much of that overhead is in the memory barrier that is needed to correctly synchronize the work queue between the current worker and a potential thief, even if a steal does not occur. (If a steal does occur, the overhead is even greater, but that overhead is presumably small compared to the benefit of parallel execution.) In a simple program like the fibonacci program above, the overheads are noticeable. In fact, we use the fib() program as a way to measure spawn overheads because the program is almost entirely overhead, with very little real work done. In a more realistic example, such as parallel quicksort, the overhead is actually negligible, as Yves V. said. Even so, it is sometimes beneficial to "coarsen" your calculation by performing the last few levels of recursion without any spawns, since the base often does very little work.

View solution in original post

Barry_T_Intel · ‎04-24-2016

The sources are available at https://www.cilkplus.org/download#runtime-sources, along with lots of other information.

Keep in mind that fib is essentially all overhead, so it's pretty much a worst case.

- Barry

Yves_V_Intel · ‎04-25-2016

cilk_spawn itself has indeed negligable overhead, you are only telling the compiler and runtime that this is a point he _can_ steal work, which does give some overhead. (note: compiling with -cilk-serialize produces no-overhead sequential code) The overhead comes both from worker threads stealing work and from synchronization points. The actual runtime overhead entirely depends on how many worker threads you have running.

Now, an idle worker thread that stealing work is an expensive operation, but is relatively rare. Important is that an idle thread when workstealing will try to steal as much work as possible. In your fib example, this means that it will try to take the work that was split at the topmost cilk_spawn. For two threads, only one steal operation should be performed, regardless of how large your n is. (note: because of minute imbalances, it is possible more steals happen as one thread finishes faster than the other)

The big problem in the fib example is the synchronization, the actual 'work' happening is just adding two numbers, but logically you have to wait for the two numbers to become available. Again, this synchronization depends on the number of threads. In an ideal scenario, two threads will only have to synchronize once, at the very top.

If you want to fairly measure the 'parallel' overhead for using Cilk, try using a small number of threads (e.g. CILK_NWORKERS=2). Alternatively, try to increase 'n', ensuring that there is actually enough work for each thread to do. A rule of thumb is that your sequential code should at least take a second.

If you actually meant 'sequential' overhead of using cilk, compare the sequential version (-cilk-serialize flag) to the parallel version with one worker (CILK_NWORKERS=1).

There is a good description of the execution model in the documentation: https://software.intel.com/en-us/node/522599

Pablo_H_Intel · ‎04-25-2016

Actually, although we like to tout how low the overhead is for a cilk_spawn, the fact is that it is realistically 5 to 15 times slower than a regular function call. Much of that overhead is in the memory barrier that is needed to correctly synchronize the work queue between the current worker and a potential thief, even if a steal does not occur. (If a steal does occur, the overhead is even greater, but that overhead is presumably small compared to the benefit of parallel execution.) In a simple program like the fibonacci program above, the overheads are noticeable. In fact, we use the fib() program as a way to measure spawn overheads because the program is almost entirely overhead, with very little real work done. In a more realistic example, such as parallel quicksort, the overhead is actually negligible, as Yves V. said. Even so, it is sometimes beneficial to "coarsen" your calculation by performing the last few levels of recursion without any spawns, since the base often does very little work.