Performance degrade by combine MPI with MKL

zhang__Shunkang · ‎03-13-2020

I am new to the field of MPI. I write my program by using Intel Math Kernel Library and I want to compute a matrix-matrix multiplication by blocks, which means that I split the large matrix X into many small matrixs along the column as the following. My matrix is large, so each time I only compute (N, M) x (M, N) where I can set M manually.

XX^Ty = X_1X_1^Ty + X_2X_2^Ty + ... + X_nX_n^Ty

I first set the number of total threads as 16 and M equals to 1024. Then I run my program directly as the following . I check my cpu state and I find that the cpu usage is 1600%, which is normal.

./MMNET_MPI --block 1024 --numThreads 16

However, I tried to run my program by using MPI as the following. Then I find that cpu usage is only 200-300%. Strangely, I change the block number to 64 and I can get a little performance improvement to cpu usage 1200%.

mpirun -n 1 --bind-to none ./MMNET_MPI --block 1024 --numThreads 16

I do not know what the problem is. It seems that mpirun does some default setting which has an impact on my program. The following is a part of my matrix multiplication code. The command `#pragma amp parallel for` aims to extract the small N by M matrix from compression format parallel. After that I use clubs_dgemv to compute the matrix-matrix multiplication.

void LMMCPU::multXXTTrace(double *out, const double *vec) const {

  double *snpBlock = ALIGN_ALLOCATE_DOUBLES(Npad * snpsPerBlock);
  double (*workTable)[4] = (double (*)[4]) ALIGN_ALLOCATE_DOUBLES(omp_get_max_threads() * 256 * sizeof(*workTable));

  // store the temp result
  double *temp1 = ALIGN_ALLOCATE_DOUBLES(snpsPerBlock);
  for (uint64 m0 = 0; m0 < M; m0 += snpsPerBlock) {
    uint64 snpsPerBLockCrop = std::min(M, m0 + snpsPerBlock) - m0;
#pragma omp parallel for
    for (uint64 mPlus = 0; mPlus < snpsPerBLockCrop; mPlus++) {
      uint64 m = m0 + mPlus;
      if (projMaskSnps)
        buildMaskedSnpCovCompVec(snpBlock + mPlus * Npad, m,
                                 workTable + (omp_get_thread_num() << 8));
      else
        memset(snpBlock + mPlus * Npad, 0, Npad * sizeof(snpBlock[0]));
    }

      // compute A=X^TV
      MKL_INT row = Npad;
      MKL_INT col = snpsPerBLockCrop;
      double alpha = 1.0;
      MKL_INT lda = Npad;
      MKL_INT incx = 1;
      double beta = 0.0;
      MKL_INT incy = 1;
      cblas_dgemv(CblasColMajor,
                  CblasTrans,
                  row,
                  col,
                  alpha,
                  snpBlock,
                  lda,
                  vec,
                  incx,
                  beta,
                  temp1,
                  incy);

      // compute XA
      double beta1 = 1.0;
      cblas_dgemv(CblasColMajor, CblasNoTrans, row, col, alpha, snpBlock, lda, temp1, incx, beta1, out,
                  incy);


  }
  ALIGN_FREE(snpBlock);
  ALIGN_FREE(workTable);
  ALIGN_FREE(temp1);
}

Actually, I have checked the following part can fully use the cpu resources. It seems that there are some problems with cblas_dgemv.

#pragma omp parallel for
    for (uint64 mPlus = 0; mPlus < snpsPerBLockCrop; mPlus++) {
      uint64 m = m0 + mPlus;
      if (projMaskSnps)
        buildMaskedSnpCovCompVec(snpBlock + mPlus * Npad, m,
                                 workTable + (omp_get_thread_num() << 8));
      else
        memset(snpBlock + mPlus * Npad, 0, Npad * sizeof(snpBlock[0]));
    }

My CPU information is as the following.

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              44
On-line CPU(s) list: 0-43
Thread(s) per core:  1
Core(s) per socket:  22
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz
Stepping:            4
CPU MHz:             1252.786
CPU max MHz:         2101.0000
CPU min MHz:         1000.0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            30976K
NUMA node0 CPU(s):   0-21
NUMA node1 CPU(s):   22-43

Kirill_V_Intel · ‎03-15-2020

Hello,

It's hard to say. How do you compile and link your application? Which OpenMP are you using? Have you seen the same with Intel MPI (I guess you're using OpenMPI)? Do you set affinity (e.g., via KMP_AFFINITY, for Intel OpenMP)?

There are multiple things you can do to investigate. You can check the bindings via --report-bindings, and use --cpu-set to probide explicitly the set of cores to be used. You can do "export MKL_VERBOSE=1" and see if the output from gemv calls shows anything weird. You can try to create a simple reproducer, where only calls to gemv exist. If the problem still exists, you can try to replace gemv by some simple scalable multi-threaded code (like adding two vectors) to check if the issue is coming from the way you set up your run configuration.

Best,
Kirill

Gennady_F_Intel · ‎03-19-2020

probably make sense to try the existing p[s,d, c,z]gemm routines. Have you tried it?