Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Threads Under Performing (OMP/PTheads)


Hi,  I'm threading loops via OMP and PThreads. I am working on a 4-quad core (4 packages with 4 cores per package) Xeon processor with an E7340 chip.  Below (a)+(b) are two loops that I threaded with OMP.  The only difference is in the for loop argument.  I set the thread number via "export OMP_NUM_THREADS=4"  and set the affinity via KMP_AFFINITY="explicit,proclist=[....]".  The odd thing is that if I pin the 4 threads to different cores that reside on different packages I get a factor of 4 speed up for both loops.  However, if I pin the threads to 4 different cores on the same package I get a factor of 4 speed up for loop (b) but no speed up for loop (a).   I don't believe there is any cache thrashing going on  (all have there own L1 cache and 2 cores per package share the same L2 cache (4 MB)) because I can set the omp parameters such that each thread acts on chunks larger than the cache, also this occurs for loop sizes many order of magnitude is size (tested up 2^28).  I also don't believe this is an issue with OMP/Pthread initiation, since speed I get the speed up for loop (a) when all cores reside on different packages.  This also occurs for pthreads where affinity is set via pthread_setaffinity_np(....).  (Also, If I use all 16 cores I get a factor of 16 speedup for loop (b), but only 4 for loop (a))

I am using the 64 bit version of the MKL library 20100414Z, linking the following libraries: -L$(MKL_PATH) -liomp5 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lrt and using the following compiler options  -O3 -ip -axPTW -D_GNU_SOURCE -openmp  (similar results occur if turn off optimizations -O0).  

  Can somebody explain why in both loops I get the expected speed up when the threads are pinned cores on different packages, but only speed up for some loops when the threads are pinned to cores on the same package?



(a) #pragma omp parallel for default (none) \

    private(i) shared(a,b,c,n)

     for (i=0;i<n;i++) c=a*b

(b) #pragma omp parallel for default (none) \

    private(i) shared(a,b,c,n)

     for (i=0;i<n;i++) c=(sin(a)+cos(b))*exp(-a)

0 Kudos
2 Replies
Valued Contributor II
>>...Can somebody explain why in both loops I get the expected speed up when the threads are >>pinned cores on different packages... Here are a couple of comments: 1. '...I don't believe there is any cache thrashing...' Did you run the VTune to verify it? 2. I think you have expected speed ups because different packages do not share cache lines ( again, use VTune to confirm it ). 3. Did you review a Datasheet for the E7340 chip on website regarding details for cache lines? 4. A complete test case would help to reproduce the problem and I could try to verify your results on an Ivy Bridge system (A). 5. '...tested up 2^28...' and for what data type ( single- or double-precision )? (A) - Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / / 32GB of Physical memory / 96GB of Virtual memory )
0 Kudos
Black Belt

The a) loop is memory bandwidth limited. If your memory system is setup for NUMA (BIOS setting) then each socket has full speed access to all of memory attached to its processor, and slower access to memory attached to other processors. On the other hand, if the memory system interleaves addresses, then all sockets faster access 1/4th the time and slower access 3/4ths the time, however being interleaved, more fetches/stores can be in flight (as to if they are I am not certain). You will experience faster performance in NUMA configuration under the circumstances where all (most) memory accesses are to locally attached memory (or in cache).

Loop b) is compute intensive, or at least has higher number of compute cycles verses memory fetch/store cycles. It appears that there is sufficient memory bandwidth in one socked as to not affect your scaling test.

Jim Dempsey

0 Kudos