my question is regarding improving the performance of following line:
MKM = MD*FA1 - MATMUL(MATMUL(MATMUL(ME,MQ),TRANSPOSE(MG)),TRANSPOSE(ME)) + MATMUL(MATMUL(MATMUL(ME,MG),VA),VR)
this line is executed for every element within a finite element implementation and is the bottleneck according to performance wizard.
All the matrices are max 12x12 by size. I have tried using DGEMM in the following way:
CALL DGEMM('N', 'N', 12, 3, 12, 1.0D0, ME, 12, MQ, 12, 0, MDUMMY3, 12)
CALL DGEMM('N', 'T', 12, 12, 3, 1.0D0, MDUMMY3, 12, MG, 12, 0, MDUMMY4, 12)
CALL DGEMM('N', 'T', 12, 12, 12, 1.0D0, MDUMMY4, 12, ME, 12, 0, MDUMMY5, 12)
CALL DGEMM('N', 'N', 12, 3, 12, 1.0D0, ME, 12, MG, 12, 0, MDUMMY6, 12)
CALL DGEMM('N', 'N', 12, 1, 3, 1.0D0, MDUMMY6, 12, VA, 12, 0, MDUMMY7, 12)
CALL DGEMM('N', 'N', 12, 12, 1, 1.0D0, MDUMMY7, 12, VR, 1, 0, MDUMMY8, 12)
MKM = MD*FA1 - MDUMMY5 + MDUMMY8
however it did not provide any improvement (I think it was even a little bit slower).
I was wondering if you would know if any MKL function or setting would help to speed up this line.
Thank you very much in advance,
Check the write-up about MKL_INLINE_SEQ e.g.
If you're using the opt-matmul option (set either explicitly or by -O3) it may not be surprising that you get similar results. In the past, I got best matmul results by setting -O3 but turning off opt-matmul when the problem is not large enough to benefit from automatic threading. You might also try setting MKL threads to 1 or linking the MKL sequential library, when using MKL explicitly or via opt-matmul, in case MKL may use too many threads when you don't specify it.
The slides above refers to the MKL 11.2 beta release, and the name of this feature (and preprocessor macro) was changed to MKL_DIRECT_CALL (or MKL_DIRECT_CALL_SEQ). I'm sorry for the confusion.
You can check the KB article here describing the feature: https://software.intel.com/en-us/articles/improve-intel-mkl-performance-for-small-problems-the-use-of-mkl-direct-call
MKL 11.2 User's guide also has a section on this: https://software.intel.com/en-us/node/528553
One needs to use MKL 11.2 which is the first MKL release that supports MKL_DIRECT_CALL(_SEQ). This feature skips error checking and some of the intermediate function calls for small matrix operations to enhance their performance. In addition to this feature, MKL 11.2 has some small matrix improvements that should help for the above sizes.
Guys, thank you very much for your replies.
I am compiling my code using Visual Studio 2010 + Intel Parallel XE 2011 (which I believe has MKL 10.3 ?).
So I guess I can't make use of MKL_DIRECT_CALL in that case, right? But still, if I would get a later version of MKL, would there be a way to set this option from Visual Studio?
I should also mention that I am compiling a dynamic link library which I call within matlab. I don't know if this would make things even more complicated or not.
OK, I guess the up to date MKL 11.2 slides were presented this week but aren't found by google search.
It looks like you would need to add the specified INCLUDE in your source file, include path in compile properties, and make sure fpp preprocessing option is set after you get the new MKL version.
Right, you can't make use of MKL_DIRECT_CALL using Visual Studio 2010 + Intel Parallel XE 2011 ( MKL 10.3, please check here )
and if you have MKL 11.2, you can set the option /DMKL_DIRECT_CALL in MSVC IDE enironment. for example, open project property page=>C/C++ tab=>Command Line=>Addition Options.
For a program in the C language on Linux system, simply add -DMKL_DIRECT_CALL or -DMKL_DIRECT_CALL_SEQ. On Windows, the syntax is /DMKL_DIRECT_CALL or /DMKL_DIRECT_CALL_SEQ. Usually, the flag -std=c99 (/Qstd=c99 on Windows) is also needed. This has been tested on mainstream C and C++ compilers such as Intel C++ Compiler, GCC, Microsoft Visual Studio, etc
Regarding mkl in Matlab usage, the dynamic dll with the option, we haven't tried. But i guess it should be work (although not sure the performance gain), with
>"C:\Program Files (x86)\Intel\Composer XE 2013 SP1\mkl\bin\mklvars.bat" intel64