I am running a visualization program (visualizing a large dataset) where I can either use MPI or pthreads. When I run it on my desktop which has an Intel i7-2600K (4 cores, 8 threads), I get better performance using pThreads (I'm using a lot of threads, e.g 32) compared to using MPI which is normal (I guess). But when I run the same code on one node (which is part of a cluster) which has Intels Xeon E5-2680 v2 (10 cores, 20 threads), the performance I get using pthreads is worse than MPI; about 70s while using MPI compared to 180s using pthreads. Even worse, the performance on the Intel Xeon E5-2680 v2 is lower than on that of the Intel i7-2600K, it's around 100s on the 2600k but 180 on the E5-2680 (same number of threads on both). I check using the top command and all the cores are active when I run the program.
So my question is why is that happening? Is there some other way I should be compiling the code on the E5-2680?
Is there some variables I should set like KMP_AFFINITY or something else?
Any suggestions will be most welcome.
- Parallel Computing
pthreads will not pay attention to KMP_AFFINITY
pthreads (well sched) does contain APIs to manipulate thread affinity.
32 software threads on 8 or 20 hardware threads (4/10 cores) may experience large variability in performance depending on the core it runs on. I suggest you look at the sched_... APIs and learn how to affinitize, and then determine the best thread assignment (logical processor assignment) for each of your pthreads. Note, you can affinitize a thread to a subset of logical processors and not just a single logical processor.
VTune will provide you some metrics in assigning pthreads to specific cores. If some of your threads are I/O bound, you might consider that (those) thread(s) to a single logical processor (and not use that logical processor for compute intensive pthreads).
Thanks for the replies guys.
On the i7, I was using 8 mpi processes and 20 on the E5, to match the number of cores on each. The threads are not IO bounds and it's a ray casting application where each thread works on one independent ray and so no barriers are needed. The times I'm quoting are the average of several runs which seem to be fairly consistent.
When kmp_affinity was set to compact instead of scatter, it did have an effect as it limited my threads to only 1 core. So I'm wondering if there are some other parameters to set.
In the meantime, I'll look at vtune to see if that can help me debug stuff. More suggestions are welcome :)
Does your pthread version of the application have any OpenMP parallel regions?
If so, then be sure to create your pthreads prior to entering the first OpenMP parallel region. The pthreads you create default to having the affinity of the thread creating the new pthread. Prior to first OpenMP parallel region, the main thread has only an affinity limited to the process, which is typically all hardware threads of all CPUs, but in your case, this may be all threads on one specific CPU (10 cores/20 hardware threads). After first OpenMP parallel region, when using KMP_AFFINITY, the main thread is restricted to one logical processor (one of the hardware threads), thus leading all your subsequently created pthreads to be restricted to the main thread's logical processor (CPU/hw thread).
Also, if your application is hybrid (mixture of OpenMP and pthreads), consider (experiment) with using KMP_BLOCKTIME and/or kmp_set_blocktime(ms) to 0 (or 1) such that the OpenMP threads suspend either immediately or after shortest possible wait when they exit the outer most parallel region.
Nope, the application is only pthreads and MPI, no openmp.
What I'm thinking could be happening as well is maybe one thread is not being allowed to run to completion but is being swapped out by the scheduler to let another thread run and then swapped in, and maybe repeatedly. This would slow it down. Is there some setting that would allow a thread to run to run to completion before swapping it out?
Use sched_getaffinity at program start and at each thread start. This function returns a bitmap of type cpu_set_t of the logical processors that the thread is permitted to run on. The main thread typically has a map of all the logical processors available, however this can be manipulated from outside the program (system admin).
The function sched_setaffinity can be used to select a subset of the logical processors that the process is permitted to run on. This can be one or more of the set the process is restricted to.
Caution, generally, a spawned thread generally inherits the affinity pinning of the thread that spawned it. Therefore, in the code that spawns the threads, do the spawning _before_ you affinity pin the thread performing the spawn. The spawned thread can re-pin to any logical processors in the process set. IOW get the process affinity mask at start of program and save it for future reference (for your threads to select subsets therefrom).
There are other sched_xxx that you should look at.
Now for the main thread, and spawned threads use pthread_setaffinity_np to select the subset of the process affinities. Do this at thread startup.
I tried the pthread_setaffinity_np and checked that it binds my threads to one core and it did. Unfortunately that did not help with performance.
I've leaned today that they have disabled hyperthreading on the cluster with the Intel Xeon E5-2680 v2 and I strongly suspect that this is the reason for the slowdown. I'll will try to turn off hyperthreading on my desktop and see if that results in a slowdown on my system too.
>>I tried the pthread_setaffinity_np and checked that it binds my threads to one core and it did.
I assume you mean each thread to a different core.
At least until you fill up the cores then you start placing two threads per core, etc...
Generally, you might want to assign weights to each thread where the weight is a function of cache requirements and compute time, as well as possibly priority. Then at program start you create a table of assigned weights, one element per core. Then as each thread starts, it searches the table for the lowest assigned weights, and assigns (adds) it's weight to that element an then sets its affinity to that core. This way your 32 threads get distributed in a balanced manner. If this produces non-optimal performance, then you can enhance the assignment algorithm with flags: Thread N not to be with M, thread X can be with thread Y, no other thread to share my core, don't affinitize this thread, low priority, high priority, ...
Word of caution - do not abuse thread priorities. Priorities only work when everything on the system cooperates, otherwise every programmer will take highest priority for all there threads (iow, the system now has a single priority).