Cilk vs OpenMP (2x speedup even with one thread)

magicdream · ‎06-21-2011

My program contains cilk_for in main cycle and parallel reducer in inner for cycle.

Cilk is really magic thing for me :). I got 2x speedup vs OpenMP on any number of working threads, including 1 (I set 1 thread as in http://software.intel.com/en-us/forums/showthread.php?t=83541). How this is possible?

Graphics in task manager are very strange for me. With 2 working threads: at the beginning first core working hard and over some time second core also starts working. Period before second core starts working is sufficiently large. With OpenMP I got full load of 2 cores from the beginning to the end, but execution time is larger in 2 times.

I dont understand 'where' Intel Cilk got this speedup with 1(!) or 2 workers, especially with 1. How you can explain this?

Barry_T_Intel · ‎06-21-2011

The speedup with 1 worker may be a result of the bug we discussed in the tread in the Cilk++ forum.

Can you post the source of your test so we can try it here?

- Barry

magicdream · ‎06-21-2011

Source code available in https://gist.github.com/1038511. There are two versions of my program: cilk and OpenMP. I assembled each one in one file for simpler compilation and viewing. Sorry, if it not compilable as one file, but looks fine.

Both versions compiled by ICPC v.12.0.4 with -O3. I tried this on my Intel Core 2 Duo and Intel Xeon X5365 (2 processors).

magicdream · ‎06-22-2011

Maybe Intel Cilk Plus use CPU vector features (SSE) in reduction? I think that this is a good explanation.

Brandon_H_Intel · ‎02-07-2012

I took a look at this - actually the OpenMP* reduction was vectorizing but the Cilk Plus version was not, surprisingly. I've submitted a problem report to our vectorizer team on that.

For the one thread, I think the "lazy scheduler" of Cilk Plus, where the bulk of the runtime initialization doesn't happen until actual parallel work starts happening helps it perform better. For the two thread case, it's not as clear to me, but the workload is still pretty small - might be interesting to see what happens with a workload that takes a bit more time and might amortize scheduling overheads.