Solved: why does cilk_spawn give me such bad performance?

rnickb · ‎08-30-2015

I wrote a simple benchmark that fills two vectors with a sequence of randomly computed numbers. I'm comparing tbb, cilk, and std::async. The results I get are (milliseconds)

time cilk: 3725
time async: 2367
time tbb: 2397
time serial: 4610

I compiled with (icc 2016)

icc -std=c++14 -xHost -O3 -tbb -pthread benchmark.cpp

And I'm running on

uname -a
Linux localhost 3.10.25-gentoo #1 SMP Wed Jan 29 22:44:07 EST 2014 x86_64 Intel(R) Xeon(R) CPU E3-1245 v3 @ 3.40GHz GenuineIntel GNU/Linux

Why is cilk so much slower than tbb and std::async?

Bradley_K_ · ‎08-30-2015

I suspect that's what going on is that cilkrts_yield() isn't quite yielding enough, and CPU cycles are being wasted, especially on a hyperthreaded system.

I wasn't able to get exactly that benchmark to compile on my system, but I built a similar version that does compile (it's a long sad story about bad gcc 5 headers and icc, maybe another time...)

If I run

taskset 3 ./benchmark

I get good behavior. The problem is that with hyperthreading the unsuccessful stealer is doing a cilkrts_yield(), which isn't enough to prevent it from using up resources

I'm not sure what the solution should be. If I replace sched_yield() [which is called by cilkrts_yield()] with this code

int sched_yield(void) { for (int i=0; i< 4000; i++) _mm_pause(); }

then the performanc is much better, but not perfect. If I do

int sched_yield(void) { for (int i=0; i< 4000; i++) _mm_pause(); usleep(1); }

It's still better.

If I set the usleep argument to 1000, then tbb and cilk behave essentially the same.

I haven't tried hacking the cilkrts to do it, but I suspect the right answer is something like this:

If the steal_failure_count is O(CILK_NWORKERS) then don't do anything.
If the steal_failure_count is of medium magnitude, then do some _mm_pause() instructions.
If the steal_failure count is of medium-large magnitude, then also do a sched_yield().
If the steal_failure count is very large then do a usleep(1000) or usleep(10000).

I have to do something like this if I want to include Cilk in production code for databases. It turns out that database operators use the cpuload to determine if the system is healthy, and they hate that libcilkrts pegs the cpuload at 100%. Those usleep() calls become important for the database people.

Another approach would be to use pthread_condition variable to wake up the worker who has gone to sleep after several unsuccessful steals. But this idea is only about 1/4-baked.

-Bradley

View solution in original post

Jim_S_Intel · ‎08-30-2015

I'm not in front of a system that has the latest compiler, so I can't run your code today. But here are a few things you might experiment with:

1. For the Cilk Plus version, it looks like there is only two way parallelism in the code (?) because of one spawn and one sync? Thus, I would try the experiment with 1 or 2 threads on all the platforms (e.g., setting CILK_NWORKERS=1 or CILK_NWORKESR=2), to see if there is still a difference. There are some cases that we haven't necessarily optimized for as heavily as TBB or made different tradeoffs (e.g., the case of having many more worker threads than number of tasks available), so it is possible you might have run into one of those cases.

Is this code representative of a real world example, or something that is simplified to isolate a performance issue?

2. If you run all the versions with a single thread, do you get about the same running times across the versions? (For Cilk Plus, that would be setting CILK_NWORKERS=1). If not, then there might be some (lack of) compiler optimization that is getting in the way.

3. Does it matter if the order of the benchmarks is switched around, or if you run it in a loop? I could imagine there being some initialization code somewhere that you are measuring in one, that isn't getting included in the others. I also don't know how reliable the results will be if you run all 4 versions in the same program. Ideally for this benchmark, the threads created by each platform would not interfere with each other, but I haven't tried it myself to see whether that is true.

Thanks,

Jim

rnickb · ‎08-30-2015

Thanks for getting back to me so quickly Jim.

1. It does perform comparably to TBB if I set CILK_NWORKERS=2. But I don't understand why it can't give decent performance with the default number of workers for cases where there's only a few tasks, as TBB does.

The code was just meant to be a simple example to test the performance of cilk and tbb, but it could easily correspond to some real-work problems.

2. Yes, running with a single thread gives times comparable to the serial version of the code.

3. I've tried switching orders; separating out into separate programs; and running the test twice (to mitigate initialization overhead). The results don't differ by anything significant.

Bradley_K_ · ‎08-30-2015

I suspect that's what going on is that cilkrts_yield() isn't quite yielding enough, and CPU cycles are being wasted, especially on a hyperthreaded system.

I wasn't able to get exactly that benchmark to compile on my system, but I built a similar version that does compile (it's a long sad story about bad gcc 5 headers and icc, maybe another time...)

If I run

taskset 3 ./benchmark

I get good behavior. The problem is that with hyperthreading the unsuccessful stealer is doing a cilkrts_yield(), which isn't enough to prevent it from using up resources

I'm not sure what the solution should be. If I replace sched_yield() [which is called by cilkrts_yield()] with this code

int sched_yield(void) { for (int i=0; i< 4000; i++) _mm_pause(); }

then the performanc is much better, but not perfect. If I do

int sched_yield(void) { for (int i=0; i< 4000; i++) _mm_pause(); usleep(1); }

It's still better.

If I set the usleep argument to 1000, then tbb and cilk behave essentially the same.

I haven't tried hacking the cilkrts to do it, but I suspect the right answer is something like this:

If the steal_failure_count is O(CILK_NWORKERS) then don't do anything.
If the steal_failure_count is of medium magnitude, then do some _mm_pause() instructions.
If the steal_failure count is of medium-large magnitude, then also do a sched_yield().
If the steal_failure count is very large then do a usleep(1000) or usleep(10000).

I have to do something like this if I want to include Cilk in production code for databases. It turns out that database operators use the cpuload to determine if the system is healthy, and they hate that libcilkrts pegs the cpuload at 100%. Those usleep() calls become important for the database people.

Another approach would be to use pthread_condition variable to wake up the worker who has gone to sleep after several unsuccessful steals. But this idea is only about 1/4-baked.

-Bradley

Jim_S_Intel · ‎08-31-2015

Thanks, Bradley! We had a similar bug report a while ago which turned out to be __cilkrts_yield()-related, which would confirm your analysis. I thought we had put in a usleep at one point to fix it, but maybe that change didn't make it out into a released version, or that we didn't tune the yield loop quite right... :(

My original thought was there are minor differences in how TBB and Cilk Plus deal with waking up threads and putting threads to sleep. But I would agree that the yield problem sounds like the much more likely culprit.

Anyway, it sounds like we should take another look at the code and see if we can a good fix...
Thanks,

Jim