- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
I evaluated my cilk application using "taskset -c 0-(x-1) MYPROGRAM) to analyze scaling behavior.
I was very suprised to see, that the performances increases up to a number of cores but decreases afterwards.
for 2 Cores, I gain a speedup of 1,85. for 4, I gain 3.15. for 8 4.34 - but with 12 cores the performance drops down
to a speedup close to the speedup gained by 2 cores (1.99).
16 cores performe slightly better (2.11)
How is such an behaviour possible? either an idle thread can steal work or it cant?! - or may the working packets be too coarse grained and the stealing overhead destroys the performance with too many cores in use?!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you post your example, or an example with similar behavior? Does your machine have a single socket (processor chip) or multiple sockets? Some ways the behavior can happen:
- As you note, work chunks might be too coarse, and so some threads starve and just run interference trying to steal.
- The communication cost of stealing work might exceed the gains from parallelism. For example, short back-to-back cilk_for loops with high bandwidth/compute ratios are sometimes slowed down by random work-stealing.
- A thread calls a package that is internally multithreaded, thus causing oversubscription.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Shouldn't CILK_NWORKERS also be set for such a purpose?
As my 12-core box is a Westmere dual (HT disabled), to run on 8 cores, with Cilk(tm) Plus using data initialized under OpenMP, I use:
export KMP_AFFINITY="proclist=[1,3,4,5,7,9,10,11],explicit" (effectively setting OMP_NUM_THREADS=8)
export CILK_NWORKERS=8
taskset -c 1,3,4,5,7,9,10,11 ./a.out
Hoping to set up so as to use a single core on each of the 4 cache links per CPU,
and I get the whole expected range of scaling characteristics, some cases running 50% faster with 12 workers than with 8, one 60% faster on 8 cores than on 12. I don't see an obvious way to predict which cases are of one characteristic or another, Due to the 4 cache paths per CPU, it's not entirely a surprise to see some cases max out at 8 workers (possibly sooner when not carefully distributed).
Even with the verbose option set, it's not clear how taskset is affecting KMP_AFFINITY, but the assumption that taskset doesn't influence that but does (as you found) limit cilkrts seems to work out.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page