Spawn slows down execution of a serial function

Frederik_S_ · ‎03-03-2016

Hey there,

I have a created a very simple program that does not make any sense in particular, but made me wounder where the observed overhead does come from:

#include <stdio.h>
#include <string.h>
#include <cilk/cilk.h>
#include <chrono>

int num = 45;

long fib(int n)
{	
	std::chrono::time_point<std::chrono::high_resolution_clock> t1;	
	if( n == num )
	{
		t1 = std::chrono::high_resolution_clock::now();	
		fprintf(stdout, "STARTED FIB\n");
	}
	
	if( n < 2 ) return n;
	
	long result = fib(n-1) + fib(n-2);
	
	if( n == num )
	{
		auto t2 = std::chrono::high_resolution_clock::now();
		auto int_ms = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1);
		fprintf(stdout, "FINISHED FIB in %ldms.\n", int_ms.count());
	}
	
	return result;	
}

int main(int argc, char** argv ) {
  cilk_spawn fib( num );
  cilk_sync;
  // fib(num);
  return 0;
}

Of course the spawn does not at all improve the program's performance, but I did it, because I wanted to measure the overhead of using spawns and syncs.

I run this program twice one time with the given main function and one time without using cilk_spawn and cilk_sync at all (using line 34 instead of 32 and 33). The output was as follows:

without spawn:
STARTED FIB
FINISHED FIB in 11836ms.

with spawn:
STARTED FIB
FINISHED FIB in 18320ms.

So why is the pure execution of the function that I didn't touched at all slowed down by this amount? Of course there is a overhead by creating threads and spawning and waiting for the spawn to be done, but my measurment takes place within the executed function and that function is not modified at all by Cilk. There are several new helper function created around this function call, but the function stays as it is. So why is the same serial function execution suddenly 7 seconds slower than before? The overhead introduced by Cilk should not be measured by my function, should it?

I am very grateful for any possible explanation on this topic.

ARCH_R_Intel · ‎03-03-2016

Are you on a machine with hyperthreaded cores?

The Cilk runtime is designed on the key assumption that once a spawn happens, there will shortly be many more spawns, enough to consume all the available threads. Thus a single spawn fires up all the threads, which become thieves looking for work. Given the key assumption, these workers will quickly find work to do. But in your example, they just waste some bandwidth, and more importantly, cause one hyperthread to pointlessly use some of a core's resources.

To see this effect without recompiling, compile the Cilk Plus version of your code and run it with different settings of the environment variable CILK_NWORKERS. With CILK_NWORKERS=1 the effect should almost completely disappear. With CILK_NWORKERS=2 the effect was slight for me on a 16-core box.

This raises the question of how to measure the overhead of spawns and syncs. If a program obeys the key assumption above, there's two sources of overhead: general interference with optimizations (modern compilers do magic for serial fib) and extra machine instructions/resources used for spawn/sync. Here's a stab at measuring the extra machine instructions/resources. Write recursive fib as usual, but split it into two functions fib and gib that mutually recurse on each other. Have fib always spawn one of its gib children. Put the two functions in separate translation units, compile them separately (to prevent too much compiler cleverness), and hope that the linker is not too clever about optimizing. Write two version of gib: one that spawns one of its fib children and one that does not. Now time the program for the two versions of gib. Both will keep the machine busy, but one will have more spawns. Measure the time difference and divide by the difference in the number of spawns.

Barry_T_Intel · ‎03-03-2016

Jim Sukha wrote an excellent article on getting the most out of your CilkPlus application: Why is Cilk™ Plus not speeding up my program?

- Barry

Hansang_B_Intel · ‎03-03-2016

I think this issue is related to https://software.intel.com/en-us/forums/intel-cilk-plus/topic/590700.

In summary, the runtime keeps every worker (thread) busy even though there is nothing to steal, and it is likely that you see some performance drop if the system has hyperthreads turned on. You can quickly check how much CPU time is spent using a command like "time".

This issue has been alleviated recently, and you can try the latest version of open-source runtime (4420) available at https://www.cilkplus.org.

Frederik_S_ · ‎03-03-2016

Arch D. Robison (Intel) wrote:

Are you on a machine with hyperthreaded cores?

The Cilk runtime is designed on the key assumption that once a spawn happens, there will shortly be many more spawns, enough to consume all the available threads. Thus a single spawn fires up all the threads, which become thieves looking for work. Given the key assumption, these workers will quickly find work to do. But in your example, they just waste some bandwidth, and more importantly, cause one hyperthread to pointlessly use some of a core's resources.

To see this effect without recompiling, compile the Cilk Plus version of your code and run it with different settings of the environment variable CILK_NWORKERS. With CILK_NWORKERS=1 the effect should almost completely disappear. With CILK_NWORKERS=2 the effect was slight for me on a 16-core box.

Thank you, for your explaining answer. This is a sensible explanation to me why my program suddenly was slower than I did expect it to be. With only one worker it is indeed almost as fast as the serial function (I measured this before, but forgot to mention it in my original post.

Barry Tannenbaum (Intel)https://www.cilkplus.org.

Thank you for this hint. I built the latest version and it improved my timings a bit. It is now 11.8s vs 12.8s, which seems to be much better :).