Less performance on 16 core than on 4 ?!

sdfsadfasdf_s_ · ‎06-11-2014

Hi there,

I evaluated my cilk application using "taskset -c 0-(x-1) MYPROGRAM) to analyze scaling behavior.

I was very suprised to see, that the performances increases up to a number of cores but decreases afterwards.

for 2 Cores, I gain a speedup of 1,85. for 4, I gain 3.15. for 8 4.34 - but with 12 cores the performance drops down
to a speedup close to the speedup gained by 2 cores (1.99).
16 cores performe slightly better (2.11)

How is such an behaviour possible? either an idle thread can steal work or it cant?! - or may the working packets be too coarse grained and the stealing overhead destroys the performance with too many cores in use?!

ARCH_R_Intel · ‎07-15-2014

Can you post your example, or an example with similar behavior? Does your machine have a single socket (processor chip) or multiple sockets? Some ways the behavior can happen:

As you note, work chunks might be too coarse, and so some threads starve and just run interference trying to steal.
The communication cost of stealing work might exceed the gains from parallelism. For example, short back-to-back cilk_for loops with high bandwidth/compute ratios are sometimes slowed down by random work-stealing.
A thread calls a package that is internally multithreaded, thus causing oversubscription.

TimP · ‎07-19-2014

Shouldn't CILK_NWORKERS also be set for such a purpose?

As my 12-core box is a Westmere dual (HT disabled), to run on 8 cores, with Cilk(tm) Plus using data initialized under OpenMP, I use:

export KMP_AFFINITY="proclist=[1,3,4,5,7,9,10,11],explicit" (effectively setting OMP_NUM_THREADS=8)

export CILK_NWORKERS=8

taskset -c 1,3,4,5,7,9,10,11 ./a.out

Hoping to set up so as to use a single core on each of the 4 cache links per CPU,

and I get the whole expected range of scaling characteristics, some cases running 50% faster with 12 workers than with 8, one 60% faster on 8 cores than on 12. I don't see an obvious way to predict which cases are of one characteristic or another, Due to the 4 cache paths per CPU, it's not entirely a surprise to see some cases max out at 8 workers (possibly sooner when not carefully distributed).

Even with the verbose option set, it's not clear how taskset is affecting KMP_AFFINITY, but the assumption that taskset doesn't influence that but does (as you found) limit cilkrts seems to work out.