Hi, I'm threading loops via OMP and PThreads. I am working on a 4-quad core (4 packages with 4 cores per package) Xeon processor with an E7340 chip. Below (a)+(b) are two loops that I threaded with OMP. The only difference is in the for loop argument. I set the thread number via "export OMP_NUM_THREADS=4" and set the affinity via KMP_AFFINITY="explicit,proclist=[....]". The odd thing is that if I pin the 4 threads to different cores that reside on different packages I get a factor of 4 speed up for both loops. However, if I pin the threads to 4 different cores on the same package I get a factor of 4 speed up for loop (b) but no speed up for loop (a). I don't believe there is any cache thrashing going on (all have there own L1 cache and 2 cores per package share the same L2 cache (4 MB)) because I can set the omp parameters such that each thread acts on chunks larger than the cache, also this occurs for loop sizes many order of magnitude is size (tested up 2^28). I also don't believe this is an issue with OMP/Pthread initiation, since speed I get the speed up for loop (a) when all cores reside on different packages. This also occurs for pthreads where affinity is set via pthread_setaffinity_np(....). (Also, If I use all 16 cores I get a factor of 16 speedup for loop (b), but only 4 for loop (a))
I am using the 64 bit version of the MKL library 20100414Z, linking the following libraries: -L$(MKL_PATH) -liomp5 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lrt and using the following compiler options -O3 -ip -axPTW -D_GNU_SOURCE -openmp (similar results occur if turn off optimizations -O0).
Can somebody explain why in both loops I get the expected speed up when the threads are pinned cores on different packages, but only speed up for some loops when the threads are pinned to cores on the same package?
(a) #pragma omp parallel for default (none) \
for (i=0;i<n;i++) c=a*b
(b) #pragma omp parallel for default (none) \
for (i=0;i<n;i++) c=(sin(a)+cos(b))*exp(-a)
The a) loop is memory bandwidth limited. If your memory system is setup for NUMA (BIOS setting) then each socket has full speed access to all of memory attached to its processor, and slower access to memory attached to other processors. On the other hand, if the memory system interleaves addresses, then all sockets faster access 1/4th the time and slower access 3/4ths the time, however being interleaved, more fetches/stores can be in flight (as to if they are I am not certain). You will experience faster performance in NUMA configuration under the circumstances where all (most) memory accesses are to locally attached memory (or in cache).
Loop b) is compute intensive, or at least has higher number of compute cycles verses memory fetch/store cycles. It appears that there is sufficient memory bandwidth in one socked as to not affect your scaling test.