Quote:Tim Prince wrote:

George_L_2 · ‎08-22-2015

Hi everyone, I have been struggling with a problem for quite some time, and I would greatly appreciate your input.

I have a program that uses Intel MKL's dgemm function many times. In fact, to demonstrate my problem, I used exactly the same code as is in this dgemm tutorial: https://software.intel.com/en-us/node/429920.

The program is compiled into a shared library that is used in two different ways 1)linked to a c++ program that can be run from the command line, and 2) loaded by R using Rcpp. So, it is identical code, that is just being loaded in two different ways.

When I set m=10042, k=62318, and n=220 (see the above link for the rest of the code), both the R code and the C++ code give the exact same result, and multithread beautifully so they have similar speed. However, when n=22 (so a smaller matrix),the C++ multithreads and gives the correct answer very quickly, but the R version acts in a single thread, and gives the same answer, but much slower.

In other words, with the smaller matrix, when R calls the function, MKL decided to use a single thread, but when my C++ program calls the function, it uses multiple threads with a much improved speed. My program contains many matrix multiplications of this sort, such that as it stands, the C++ program is dramatically faster than the R program--all because MKL decides to multithread in one, but not the other.

I am aware that MKL uses some heuristics to decide when it is worth the overhead to multithread. However, I feel that these heuristics are being used to my detriment. Also, how could these heuristics be different if R is calling the library vs. C++ is calling the library?

EDIT 1 8/23 UPDATE:

When I use the dgemm() function (i.e. not the cblas_dgemm() ) function, my problem is resolved (they all multithread). I do not know, however, how this dgemm() is different than the cblas_dgemm(), I thought that the latter was just a cblas-like interface to the dgemm(). I thought that maybe it is because (for some reason) dgemm uses a column major order, but I tried using cblas_dgemm() in column major and row major, and it makes no difference.

What is the distinction between the dgemm() and the cblas_dgemm(), and how could it affect multi-threading in this case?

TimP · ‎08-22-2015

Check that your R link line is set for mkl parallel. You may wish to compare recommendations of mkl link advisor utility. Check environment variables for omp_num_threads and mkl settings. In the unlikely case where mkl is called in a parallel region you will need further settings.. .. . .. Uh

George_L_2 · ‎08-22-2015

Tim Prince wrote:

Check that your R link line is set for mkl parallel. You may wish to compare recommendations of mkl link advisor utility. Check environment variables for omp_num_threads and mkl settings. In the unlikely case where mkl is called in a parallel region you will need further settings.. ..
.

.. Uh

Thanks Tim, I appreciate the response.

The link line is set in accordance with the MKL Link Advisor utility. Also, I already tried using the set_omp_num_threads functions, but to no avail.

Please note that this code does multithread, but only when the matrix is big enough (e.g. if k=100), but when it is smaller (k=20), it does not. However, it always multithreads for the C++ code, which is why it is so strange.

TimP · ‎08-22-2015

MKL would try to optimize number of threads according to problem size and selected instruction set. For example, AVX2 probably gives best performance at 1 thread in the range where you are seeing that selection. The thresholds changed sometimes with MKL updates. I can't see how calling from R would produce different thresholds than calling from C++, unless MKL detects that R is spinning a thread and doesn't want to wait for it to become available.

If mixing OpenMP threaded model MKL parallel with Cilk(tm) Plus, I would restrict the number of threads used by each, so that KMP_BLOCKTIME doesn't prevent Cilk from getting some threads immediately upon demand.

George_L_2 · ‎08-23-2015

Tim Prince wrote:

MKL would try to optimize number of threads according to problem size and selected instruction set. For example, AVX2 probably gives best performance at 1 thread in the range where you are seeing that selection. The thresholds changed sometimes with MKL updates. I can't see how calling from R would produce different thresholds than calling from C++, unless MKL detects that R is spinning a thread and doesn't want to wait for it to become available.

If mixing OpenMP threaded model MKL parallel with Cilk(tm) Plus, I would restrict the number of threads used by each, so that KMP_BLOCKTIME doesn't prevent Cilk from getting some threads immediately upon demand.

Thank you for the input, Tim.

I do have one more piece of information I just discovered. When I use the dgemm() function (i.e. not the cblas_dgemm() ) function, my problem is resolved. I do not know, however, how this dgemm() is different than the cblas_dgemm(), I thought that the latter was just a cblas-like interface to the dgemm(). I thought that maybe it is because (for some reason) dgemm uses a column major order, but I tried using cblas_dgemm() in column major and row major, and it makes no difference.

Can you give me some insight as to the distinction between the dgemm() and the cblas_dgemm()?

TimP · ‎08-23-2015

cblas_dgemm should work interchangeably with MKL or the libblas provided as an OS install option. Check your link options and which shared objects ldd shows active. It should not matter where cblas_dgemm is linked from unless that affects which dgemm is active.

George_L_2 · ‎08-24-2015

Tim Prince wrote:

cblas_dgemm should work interchangeably with MKL or the libblas provided as an OS install option. Check your link options and which shared objects ldd shows active. It should not matter where cblas_dgemm is linked from unless that affects which dgemm is active.

I agree, that makes a lot of sense.

When I run pmap on the R process (after it has loaded my library), I see the following:

00007f2a23e92000  11152K r-x--  /opt/intel/composer_xe_2015.1.133/mkl/lib/intel64/libmkl_gnu_thread.so
00007f2a24976000   2048K -----  /opt/intel/composer_xe_2015.1.133/mkl/lib/intel64/libmkl_gnu_thread.so
00007f2a24b76000      4K r----  /opt/intel/composer_xe_2015.1.133/mkl/lib/intel64/libmkl_gnu_thread.so
00007f2a24b77000     56K rw---  /opt/intel/composer_xe_2015.1.133/mkl/lib/intel64/libmkl_gnu_thread.so
00007f2a24b85000  25468K r-x--  /opt/intel/composer_xe_2015.1.133/mkl/lib/intel64/libmkl_core.so
00007f2a26464000   2048K -----  /opt/intel/composer_xe_2015.1.133/mkl/lib/intel64/libmkl_core.so
00007f2a26664000     96K r----  /opt/intel/composer_xe_2015.1.133/mkl/lib/intel64/libmkl_core.so
00007f2a2667c000     56K rw---  /opt/intel/composer_xe_2015.1.133/mkl/lib/intel64/libmkl_core.so
00007f2a2668a000    336K rw---    [ anon ]
00007f2a266de000   6772K r-x--  /opt/intel/composer_xe_2015.1.133/mkl/lib/intel64/libmkl_intel_ilp64.so
00007f2a26d7b000   2044K -----  /opt/intel/composer_xe_2015.1.133/mkl/lib/intel64/libmkl_intel_ilp64.so
00007f2a26f7a000      4K r----  /opt/intel/composer_xe_2015.1.133/mkl/lib/intel64/libmkl_intel_ilp64.so
00007f2a26f7b000     64K rw---  /opt/intel/composer_xe_2015.1.133/mkl/lib/intel64/libmkl_intel_ilp64.so

....
00007f2a64712000  22472K r-x--  /usr/lib/openblas-base/libblas.so.3
00007f2a65d04000   2048K -----  /usr/lib/openblas-base/libblas.so.3
00007f2a65f04000     28K r----  /usr/lib/openblas-base/libblas.so.3
00007f2a65f0b000     68K rw---  /usr/lib/openblas-base/libblas.so.3
....

Most notably, it has this "openblas" implementation of blas, because it is linked to the actual R installation, and I assume it is accessible in the C code. So, I assume that when I am trying to run my software and call cblas_dgemm, it must be calling the openblas implementation, and not the intelMKL implementation. I confirmed this by using nm to find that libblas.so.3 also has a dynamic symbol called cblas_dgemm.

I am trying to using objcopy to change the name of cblas_dgemm to something like cblas_dgemm_g, as detailed here, but when my code calls cblas_dgemm_g when linked to the new library, I keep getting "undefined reference" errors to the new name, because cblas_dgemm is a dynamic symbol that cannot be changed with objcopy.

How can I make it so that my code calls the cblas_dgemm from MKL? I don't want to recompile R with Intel MKL, because that is not a feasible solution for the end user of my package.

Multithreading when called by C++, not when called by R