- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I am new to the field of MPI. I write my program by using Intel Math Kernel Library and I want to compute a matrix-matrix multiplication by blocks, which means that I split the large matrix X into many small matrixs along the column as the following. My matrix is large, so each time I only compute (N, M) x (M, N) where I can set M manually.

XX^Ty = X_1X_1^Ty + X_2X_2^Ty + ... + X_nX_n^Ty

I first set the number of total threads as 16 and M equals to 1024. Then I run my program directly as the following . I check my cpu state and I find that the cpu usage is 1600%, which is normal.

./MMNET_MPI --block 1024 --numThreads 16

However, I tried to run my program by using MPI as the following. Then I find that cpu usage is only 200-300%. Strangely, I change the block number to 64 and I can get a little performance improvement to cpu usage 1200%.

mpirun -n 1 --bind-to none ./MMNET_MPI --block 1024 --numThreads 16

I do not know what the problem is. It seems that mpirun does some default setting which has an impact on my program. The following is a part of my matrix multiplication code. The command `#pragma amp parallel for` aims to extract the small N by M matrix from compression format parallel. After that I use clubs_dgemv to compute the matrix-matrix multiplication.

void LMMCPU::multXXTTrace(double *out, const double *vec) const { double *snpBlock = ALIGN_ALLOCATE_DOUBLES(Npad * snpsPerBlock); double (*workTable)[4] = (double (*)[4]) ALIGN_ALLOCATE_DOUBLES(omp_get_max_threads() * 256 * sizeof(*workTable)); // store the temp result double *temp1 = ALIGN_ALLOCATE_DOUBLES(snpsPerBlock); for (uint64 m0 = 0; m0 < M; m0 += snpsPerBlock) { uint64 snpsPerBLockCrop = std::min(M, m0 + snpsPerBlock) - m0; #pragma omp parallel for for (uint64 mPlus = 0; mPlus < snpsPerBLockCrop; mPlus++) { uint64 m = m0 + mPlus; if (projMaskSnps) buildMaskedSnpCovCompVec(snpBlock + mPlus * Npad, m, workTable + (omp_get_thread_num() << 8)); else memset(snpBlock + mPlus * Npad, 0, Npad * sizeof(snpBlock[0])); } // compute A=X^TV MKL_INT row = Npad; MKL_INT col = snpsPerBLockCrop; double alpha = 1.0; MKL_INT lda = Npad; MKL_INT incx = 1; double beta = 0.0; MKL_INT incy = 1; cblas_dgemv(CblasColMajor, CblasTrans, row, col, alpha, snpBlock, lda, vec, incx, beta, temp1, incy); // compute XA double beta1 = 1.0; cblas_dgemv(CblasColMajor, CblasNoTrans, row, col, alpha, snpBlock, lda, temp1, incx, beta1, out, incy); } ALIGN_FREE(snpBlock); ALIGN_FREE(workTable); ALIGN_FREE(temp1); }

Actually, I have checked the following part can fully use the cpu resources. It seems that there are some problems with cblas_dgemv.

#pragma omp parallel for for (uint64 mPlus = 0; mPlus < snpsPerBLockCrop; mPlus++) { uint64 m = m0 + mPlus; if (projMaskSnps) buildMaskedSnpCovCompVec(snpBlock + mPlus * Npad, m, workTable + (omp_get_thread_num() << 8)); else memset(snpBlock + mPlus * Npad, 0, Npad * sizeof(snpBlock[0])); }

My CPU information is as the following.

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 44 On-line CPU(s) list: 0-43 Thread(s) per core: 1 Core(s) per socket: 22 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz Stepping: 4 CPU MHz: 1252.786 CPU max MHz: 2101.0000 CPU min MHz: 1000.0000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 30976K NUMA node0 CPU(s): 0-21 NUMA node1 CPU(s): 22-43

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello,

It's hard to say. How do you compile and link your application? Which OpenMP are you using? Have you seen the same with Intel MPI (I guess you're using OpenMPI)? Do you set affinity (e.g., via KMP_AFFINITY, for Intel OpenMP)?

There are multiple things you can do to investigate. You can check the bindings via --report-bindings, and use --cpu-set to probide explicitly the set of cores to be used. You can do "export MKL_VERBOSE=1" and see if the output from gemv calls shows anything weird. You can try to create a simple reproducer, where only calls to gemv exist. If the problem still exists, you can try to replace gemv by some simple scalable multi-threaded code (like adding two vectors) to check if the issue is coming from the way you set up your run configuration.

Best,

Kirill

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

probably make sense to try the existing p[s,d, c,z]gemm routines. Have you tried it?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page