I have a Fortran code that runs both MPI and OpenMP. I have done some profiling of the code on an 8 core windows laptop varying the number of mpi tasks vs. openmp threads and have some understanding of where some performance bottlenecks for each parallel method might surface. The problem I am having is when I port over to a Linux cluster with several 8-core nodes. Specifically, my openmp thread parallelism performance is very poor. Running 8 mpi tasks per node is significantly faster than 8 openmp threads per node (1 mpi task), but even 2 omp threads + 4 mpi tasks runs was running very slowly, more so than I could solely attribute to a thread starvation issue. I saw a few related posts in this area and am hoping for further insight and recommendations in to this issue. What I have tried so far ...
1. setenv OMP_WAIT_POLICY active ## seems to make sense
2. setenv KMP_BLOCKTIME 1 ## this is counter to what I have read but when I set this to a large number (25000) code is very slow
3. removed some old "unlimited" limit settings (viz., stacksize, coresize) that I have had since "dawn of time." This also helped openmp thread performance significantly.
It seems I am looking for ways to reasonably assure my OpenMP threads don't vanish between the parallel regions I have in the code and making sure these threads are as system-wise lightweight as possible. These above corrections do not seem to impact mpi tasking. Are there any other
recommendations? By the way, the mpi tasks are using an mvapich library on a cluster with IB. The code is compiled with "-openmp" (-Qopenmp).
Thank you in advance.
I'm guessing that you haven't done anything to control affinity when you combine multiple ranks with OpenMP on a node. This is particularly important if your nodes are NUMA, where you should pin each rank to a group of cores which shares a cache and take care that you spread threads across cores (which may be difficult if HyperThreading is engaged).
It's not clear to me, even after reading
whether mvapich has the hybrid affinity options similar to Intel MPI (where it works by default) or openmpi (where you must specify it). It seems that mvapich is (or was 4 years ago) not designed for this, but would work if you disabled mvapich affinity and set up your job so as to specify a separate OpenMP affinity group for each rank, using OMP_PLACES or KMP_AFFINITY.