topic The slides above refers to in Intel® oneAPI Math Kernel Library

Performance of matmul vs dgemm for small size matrices

e112974 — Wed, 29 Oct 2014 10:47:00 GMT

Hi,

my question is regarding improving the performance of following line:

------------------------

MKM = MD*FA1 - MATMUL(MATMUL(MATMUL(ME,MQ),TRANSPOSE(MG)),TRANSPOSE(ME)) + MATMUL(MATMUL(MATMUL(ME,MG),VA),VR)

------------------------

this line is executed for every element within a finite element implementation and is the bottleneck according to performance wizard.

All the matrices are max 12x12 by size. I have tried using DGEMM in the following way:

------------------------

CALL DGEMM('N', 'N', 12, 3, 12, 1.0D0, ME, 12, MQ, 12, 0, MDUMMY3, 12)

CALL DGEMM('N', 'T', 12, 12, 3, 1.0D0, MDUMMY3, 12, MG, 12, 0, MDUMMY4, 12)

CALL DGEMM('N', 'T', 12, 12, 12, 1.0D0, MDUMMY4, 12, ME, 12, 0, MDUMMY5, 12)

CALL DGEMM('N', 'N', 12, 3, 12, 1.0D0, ME, 12, MG, 12, 0, MDUMMY6, 12)

CALL DGEMM('N', 'N', 12, 1, 3, 1.0D0, MDUMMY6, 12, VA, 12, 0, MDUMMY7, 12)

CALL DGEMM('N', 'N', 12, 12, 1, 1.0D0, MDUMMY7, 12, VR, 1, 0, MDUMMY8, 12)

MKM = MD*FA1 - MDUMMY5 + MDUMMY8

------------------------

however it did not provide any improvement (I think it was even a little bit slower).

I was wondering if you would know if any MKL function or setting would help to speed up this line.

Thank you very much in advance,

Murat

Check the write-up about MKL

TimP — Wed, 29 Oct 2014 11:43:28 GMT

Check the write-up about MKL_INLINE_SEQ e.g.

https://software.intel.com/sites/default/files/managed/8c/ef/Intel-MKL-11.2-beta-webinar--Introducing-new-features.pdf

If you're using the opt-matmul option (set either explicitly or by -O3) it may not be surprising that you get similar results. In the past, I got best matmul results by setting -O3 but turning off opt-matmul when the problem is not large enough to benefit from automatic threading. You might also try setting MKL threads to 1 or linking the MKL sequential library, when using MKL explicitly or via opt-matmul, in case MKL may use too many threads when you don't specify it.

The slides above refers to

Murat_G_Intel — Wed, 29 Oct 2014 17:26:00 GMT

The slides above refers to the MKL 11.2 beta release, and the name of this feature (and preprocessor macro) was changed to MKL_DIRECT_CALL (or MKL_DIRECT_CALL_SEQ). I'm sorry for the confusion.

You can check the KB article here describing the feature: https://software.intel.com/en-us/articles/improve-intel-mkl-performance-for-small-problems-the-use-of-mkl-direct-call

MKL 11.2 User's guide also has a section on this: https://software.intel.com/en-us/node/528553

One needs to use MKL 11.2 which is the first MKL release that supports MKL_DIRECT_CALL(_SEQ). This feature skips error checking and some of the intermediate function calls for small matrix operations to enhance their performance. In addition to this feature, MKL 11.2 has some small matrix improvements that should help for the above sizes.

Thank you!

Guys, thank you very much for

e112974 — Wed, 29 Oct 2014 17:41:12 GMT

Guys, thank you very much for your replies.

I am compiling my code using Visual Studio 2010 + Intel Parallel XE 2011 (which I believe has MKL 10.3 ?).

So I guess I can't make use of MKL_DIRECT_CALL in that case, right? But still, if I would get a later version of MKL, would there be a way to set this option from Visual Studio?

I should also mention that I am compiling a dynamic link library which I call within matlab. I don't know if this would make things even more complicated or not.

Best regards,

Murat

OK, I guess the up to date

TimP — Wed, 29 Oct 2014 20:44:00 GMT

OK, I guess the up to date MKL 11.2 slides were presented this week but aren't found by google search.

It looks like you would need to add the specified INCLUDE in your source file, include path in compile properties, and make sure fpp preprocessing option is set after you get the new MKL version.

Hi e112974,

Ying_H_Intel — Mon, 03 Nov 2014 06:29:58 GMT

Hi e112974,

Right, you can't make use of MKL_DIRECT_CALL using Visual Studio 2010 + Intel Parallel XE 2011 ( MKL 10.3, please check here )

and if you have MKL 11.2, you can set the option /DMKL_DIRECT_CALL in MSVC IDE enironment. for example, open project property page=>C/C++ tab=>Command Line=>Addition Options.

For a program in the C language on Linux system, simply add -DMKL_DIRECT_CALL or -DMKL_DIRECT_CALL_SEQ. On Windows, the syntax is /DMKL_DIRECT_CALL or /DMKL_DIRECT_CALL_SEQ. Usually, the flag -std=c99 (/Qstd=c99 on Windows) is also needed. This has been tested on mainstream C and C++ compilers such as Intel C++ Compiler, GCC, Microsoft Visual Studio, etc

Regarding mkl in Matlab usage, the dynamic dll with the option, we haven't tried. But i guess it should be work (although not sure the performance gain), with

>"C:\Program Files (x86)\Intel\Composer XE 2013 SP1\mkl\bin\mklvars.bat" intel64

>set BLAS_VERSION=mkl_rt.dll

>set LAPACK_VERSION=mkl_rt.dll

Best Regards,

Ying