Cilk Plus does not meet Performance Expectations

Changming_X_ · ‎06-16-2016

I wrote a benchmark testing a parallel Longest Common Subsequence algorithm. It uses a dynamic divide and conquer method with a 2 Dimensional Array. It does not get the expected speedup when running on increasing worker amount.

The basic method code looks as follows:

void LCSsectsolve(int ** stor, int startX, int startY, int width){
	for(int i = startX; i < startX + width; ++i){
		for(int j = startY; j < startY + width; ++j){
			if(a[i-1] == b[j-1]){
				stor = stor[i-1][j-1] + 1;
			} else {
				stor = std::max(stor[i-1], stor[j-1]);
			}
		}
	}
}

int parallelNEWdivLCS(int ** stor, int startX, int startY, int width, int base){
	if(width <= base){
		LCSsectsolve(stor, startX, startY, width);
	} else {
		parallelNEWdivLCS(stor, startX, startY, width/2, base);
		cilk_spawn parallelNEWdivLCS(stor, startX + width/2, startY, width/2, base);
		parallelNEWdivLCS(stor, startX, startY + width/2, width/2, base);
		cilk_sync;
		parallelNEWdivLCS(stor, startX + width/2, startY + width/2, width/2, base);
	}
	return stor[startX + width - 1][startY + width - 1];
}

The code runs as expected (similarly to serial) for a single worker, and gets roughly expected speedup for 2 and 4 workers, but only gets about 6x speedup for 8 workers, and does similarly if not worse and 16 and 32 workers. I have tried a variety of base cases and sizes of test cases, but none seem to improve the performance. Using a variety of performance measuring techniques (perf, PAPI) I have determined that the work in the base cases seems about constant. I believe that I should have enough parallelism and work in order to expect better speedup. Enabling cilk_profile tells me that the time spent in the system scheduler seems high as the workers only spend about 30% of there time doing work. If you disable the cilk_yield call in the scheduler, it seems like most of the time is spent on trying but failing to steal work.

I am running on x86_64 GNU/Linux non-hyperthreaded 32 cores of Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz with LLVM-Cilk.

I have also tried making the storage a 1 Dimensional array to increase locality but it does not change the performance. I can provide full code if needed as well as run any tests you think may help determine this issue or answer any clarification questions.

Thanks in Advance!

TimP · ‎06-17-2016

Could you compare against OpenMP, where you could try either to optimize scheduling and affinity or to approximate the scheduling expected from Cilk(tm) Plus? You might expect thread/worker affinity to be needed for scaling across CPUs, assuming that you have a 4-CPU platform. Intensive work stealing might not be an effective strategy when it is possible to schedule work and affinity in advance; a contrary expectation might be unrealistic.

As you have a situation apparently involving potential cache line prefetch sharing among adjacent threads or workers, OMP_PROC_BIND=close might improve performance.

The E5 4-CPU platform can be frustrating when attempting to scale a threading performance model across CPUs. Customers tend not to read the fine print when deploying these (not even noticing that it is a NUMA platform). It may be better suited to a multi-level parallelism model using a distributed memory model such as MPI to run an independent process on each CPU.

jimdempseyatthecove · ‎06-17-2016

The problem with scaling you are observing with LCDSsectsolve is that the compute section has very little computation as compared to memory fetch and store. IOW the function is memory bandwidth limited.

*** Also, more importantly ***

Due to temporal requirements, the algorithm as written cannot be parallelized (and produce the correct result).

The stor=... values are dependent upon prior (iteration) written elements to stor.

Jim Dempsey