I have a code where I have several call to omp_set_lock() and omp_unset_lock(). When I compile this code with icc I get a considerable slow down. For instance when run it using more than one thread, the time I get is stable at about the double of running one thread. But when I compile with gcc I get a good (but not great) speedup ut to 16 threads. Comparing the running times using 16 threads the code using icc is about 52 times slower than when using gcc!
Obviously there seems to be some issue about how icc handles locks. Could someone confirm if this is a known issue and if there is any known work around?
I am using gcc version 4.1.2 20080704 and icc version 11.1. The system has a four E7-4850 cpus.
This question might get a more useful response on the icc forum. I think it is a known issue in that libiomp5 is said to emphasize correctness over performance in implementation of locks and barriers, while the priority in libgomp seems to be the other way round. I've heard of some performance opportunities being identified in libiomp5, but I'm still waiting for something to test. What affinity settings do you use in each case?
KMP_AFFINITY=compact ought to place your threads consistently on the smallest possible number of CPUs. KMP_AFFINITY=none ought to restore the default (each thread can jump to any core made available by OS scheduler). GOMP_CPU_AFFINITY, as understood by libgomp, is also emulated by the Intel library. GOMP_CPU_AFFINITY=0-15 presumably would be equivalent to KMP_AFFINITY="proclist=[0-15],explicit" You may require the verbose option for KMP_AFFINITY to verify how the cores and logicals are numbered. I didn't see whether you are testing with HyperThreading.