Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
91 Views

Intel MKL performance degrade a lot when I combine it with openMPI

I am using the intel math kernel library to write my algorithm and I set the number of threads to 16. My program can work well. However, when I tried to combine the MKL with MPI and run my program with 

mpirun -n 1 ./MMNET_MPI

I think this will give me the same result as I directly run my program as the following.

./MMNET_MPI

However, the performance of my program degrades a lot when I used 16 threads and the activate threads are only 2 or 3. I am not sure what the problem is. The part of my MKL program is as the following. 

void LMMCPU::multXXTTrace(double *out, const double *vec) const {

  double *snpBlock = ALIGN_ALLOCATE_DOUBLES(Npad * snpsPerBlock);
  double (*workTable)[4] = (double (*)[4]) ALIGN_ALLOCATE_DOUBLES(omp_get_max_threads() * 256 * sizeof(*workTable));

  // store the temp result
  double *temp1 = ALIGN_ALLOCATE_DOUBLES(snpsPerBlock);
  for (uint64 m0 = 0; m0 < M; m0 += snpsPerBlock) {
    uint64 snpsPerBLockCrop = std::min(M, m0 + snpsPerBlock) - m0;
#pragma omp parallel for
    for (uint64 mPlus = 0; mPlus < snpsPerBLockCrop; mPlus++) {
      uint64 m = m0 + mPlus;
      if (projMaskSnps)
        buildMaskedSnpCovCompVec(snpBlock + mPlus * Npad, m,
                                 workTable + (omp_get_thread_num() << 8));
      else
        memset(snpBlock + mPlus * Npad, 0, Npad * sizeof(snpBlock[0]));
    }

    for (uint64 iter = 0; iter < estIteration; iter++) {
      // compute A=X^TV
      MKL_INT row = Npad;
      MKL_INT col = snpsPerBLockCrop;
      double alpha = 1.0;
      MKL_INT lda = Npad;
      MKL_INT incx = 1;
      double beta = 0.0;
      MKL_INT incy = 1;
      cblas_dgemv(CblasColMajor,
                  CblasTrans,
                  row,
                  col,
                  alpha,
                  snpBlock,
                  lda,
                  vec + iter * Npad,
                  incx,
                  beta,
                  temp1,
                  incy);

      // compute XA
      double beta1 = 1.0;
      cblas_dgemv(CblasColMajor, CblasNoTrans, row, col, alpha, snpBlock, lda, temp1, incx, beta1, out + iter * Npad,
                  incy);

    }

  }
  ALIGN_FREE(snpBlock);
  ALIGN_FREE(workTable);
  ALIGN_FREE(temp1);
}

 

0 Kudos
7 Replies
Highlighted
Black Belt
91 Views

I don't remember the details for OpenMPI, but it is important to check the binding of the MPI tasks -- it is common for MPI stacks to default to binding each MPI task to the same set of cores (a good choice if they are all running on different nodes and a bad choice if they are all running on the same node).   Your description of only seeing 2-3 threads running sounds like it is pointing to this sort of problem.

You probably also want to set the environment variable "MKL_NUM_THREADS" to the number of physical cores divided by the number of MPI tasks running on those cores.  By default, each *process* running MKL will try to use all the cores that are available, and in this case each MPI rank is a different process.

"Dr. Bandwidth"
0 Kudos
Highlighted
Novice
91 Views

Did you run this on an Intel or an AMD based h/w ?  

best

Michael

0 Kudos
Highlighted
91 Views

McCalpin, John (Blackbelt) wrote:

I don't remember the details for OpenMPI, but it is important to check the binding of the MPI tasks -- it is common for MPI stacks to default to binding each MPI task to the same set of cores (a good choice if they are all running on different nodes and a bad choice if they are all running on the same node).   Your description of only seeing 2-3 threads running sounds like it is pointing to this sort of problem.

You probably also want to set the environment variable "MKL_NUM_THREADS" to the number of physical cores divided by the number of MPI tasks running on those cores.  By default, each *process* running MKL will try to use all the cores that are available, and in this case each MPI rank is a different process.

Actually, I have set the threads used in my program by using "omp_set_num_threads". At first, I believe that there is something wrong with the MPI tasks binding. However, I also did another test. I only ran the for loop in my program by using `#pragma for parallel`. Finally, I found that I can use 16 threads normally. That is what I am confused. When I involve the MKL API, the activate threads are only 2 or 3.

0 Kudos
Highlighted
91 Views

drMikeT wrote:

Did you run this on an Intel or an AMD based h/w ?  

best

Michael

 

My cpu information is as the following. 

 

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              44
On-line CPU(s) list: 0-43
Thread(s) per core:  1
Core(s) per socket:  22
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz
Stepping:            4
CPU MHz:             1252.786
CPU max MHz:         2101.0000
CPU min MHz:         1000.0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            30976K
NUMA node0 CPU(s):   0-21
NUMA node1 CPU(s):   22-43

 

0 Kudos
Highlighted
91 Views

>>mpirun -n 1 ./MMNET_MPI...
>>Actually, I have set the threads used in my program by using "omp_set_num_threads".

The mpirun -n 1 {image} will restrict the MPI to one node. A node is one "memory sub-system" be it physical or virtual. A node is essentially is a socket (physical CPU chip). These may be on the same motherboard or networked in some manner. The system in #5 has two sockets, thus two nodes. However it may also be networked in a cluster and thus the complete MPI "system" may have many more nodes available to it.

The "-n 1" restricts the MPI application to one of the available nodes. As to which one, isn't specified on the command line but a preference could be made via environment variable or config file.

Assuming nothing is specified as to which host or node to run on, the application may be free to choose the node (possibly least used or always node 0). For the system listed in post #5, the application will use one of the NUMA nodes using 22 cores. This will be 22 processes, one per each core. These are pinned processes (at least using the Intel MPI system).

Now, within each process (each rank), you have specified to use 16 OpenMP threads. *** meaning each rank will run 16 OpenMP threads constricted to its single core (hardware thread). I do not think this is what you intended.

You have not configured your run to run with your expectations.

What you could do is

mpirun -n 1 -ppn 2 ./MMNET_MPI...      (-ppn n is processes per node)

and then use up to 11 OpenMP (which would be the default on your system with -ppn 2).
This states: use 1 node, split that node into 2 ranks (processes), restrice (pin) each process to 1/2 the logical processors on each node (only 1 node used).

*** MKL has two libraries:
1) a threaded library using OpenMP *** that is to be linked into a single thread process ***
2) a single thread library *** that is to be linked into a multi thread process ***

The default is 1).

*** Because each of your ranks is multi-threaded using OpenMP, you should link in the single thread MKL library.

You could also perform:

mpirun -n 2 -ppn 1 ./MMNET_MPI

and then select 16 threads for use by OpenMP

This would provide two processes (1 per node), each with 16 of the 22 logical processors per node.

It is your responsibility to divvy up the available compute resources in a meaningful (productive) manner.

Jim Dempsey

0 Kudos
Highlighted
Moderator
91 Views

Hi Shunkang,

Did you get the solution you are looking for?

Do let us know.

Thanks

Prasanth

0 Kudos
Highlighted
Moderator
91 Views

Hi Shunkang,

We are closing this thread assuming your performance issues resolved after following per Jim's suggestions.

Please raise a new thread if you have any further issues.

 

Thanks

Prasanth 

0 Kudos