fastest multithreaded way to take matrix inverse??

Vineet_Y_ · ‎03-28-2013

Hi

I have two questions about MKL-LAPACK

(1) What is the best (multithreaded) way to invert an arbitrary or a real symmetric matrix

I have tried using dgetrf and dgetri but they are considerably slower than Matlab. I am using dgetrf and dgetri as I have to use the factorization produced by dgetrf multiple times to solve linear systems of equations by providing a different right hand side in each iteration in an inverse modeling problem. I can also use dgetrs after dgetrf for solving my problem but speed wise it does not improve things.

(2) I know MATLAB internally uses MKL so which routine does it use to get much better (I know it is very difficult to answer this) performance than directly using MKL. In my application I have to invert a matrix with dimension 45000X45000 which is very slow and only uses a single thread in MKL in comparison to MATLAB.

A SAMPLE FORTRAN CODE THAT SHOWS MY IMPLEMENTATION OF DGETRF AND DGETRI IS ATTACHED. IN THE ATTACHED CODE I CREATE A POSITIVE DEFINITE MATRIX FOR INVERSION .HOWEVER IN THE REAL APPLICATION THE MATRIX IS NOT POSITIVE DEFINITE

Other details: It is a 64 bit application

Platform: Intel(R) Core(TM) i7-2600 CPU 3.40GHz for the test code/16GB of RAM

************************Application Specific Platform***************************************

vendor_id : GenuineIntel

RAM: 96 GB, cpu family : 6, model : 44, model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz

stepping : 2, cpu MHz : 2801.000, cache size : 12288 KB, cpu cores : 6, apicid : 0, fpu : yes

fpu_exception : yes, cpuid level : 11, wp : yes

bogomips : 5600.36, clflush size : 64, cache_alignment : 64, address sizes : 40 bits physical, 48 bits virtual

********************************************************************************************************

MKL: Parallel version

MKL libraries included: mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib mkl_lapack95_lp64.lib

Many thanks

Zhang_Z_Intel · ‎03-28-2013

getrf in MKL is threaded, but getri is not. For a full list of threaded LAPACK routiens in MKL, see here.

In addition to multithreading, data alignment is also very important to get good performance. You didn't do this in your sample code. On i7-2600 CPU, you should align your input data to 32 byte boundaries to get good performance.

These are the MKL libraries you need to link with: mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib. You need to explicitly link libiomp5md.lib only if you are not using Intel Fortran compiler. If you use Intel Fortran compiler, please pass /Qopenmp

Can you please provide more information?

What Fortran compiler do you use?
What is your MKL version?
Did you set OMP_NUM_THREADS or MKL_NUM_THREADS?
How big is the difference between MKL and Matlab run times?

Thanks.

Vineet_Y_ · ‎03-28-2013

Matlab code is simply: a=rand(5000); tic; b=a\speye(length(a)); toc which takes ~ 12 seconds whereas dgetrf+dgetri takes = 36.88

I am using Intel Parallel Studio 2013 on a Windows Machine (Intel(R) Core(TM) i7-2600 CPU) with Visual Studio as a development environment and a cluster with intel xeon processors (as mentioned above) with composer_xe_2011 for running codes. I did not set OMP_NUM_THREADS as dgemm in the same code uses all the cores but if it is required than I can set it.

On another note I can use dsytrf and dsytrs for my work (real symmetric matrices) and as per documentation both are threaded routines. However the error by using these routines in comparison to dgetrt/dgetri is extremely large that I avoid using it. I will upload the code to demonstrate these errors, in this case it might be possible that I am not specifying some parameters correctly but the answer from info shows everything as correct

Many Thanks

Vineet

Zhang_Z_Intel · ‎03-29-2013

Vineet,

A big problem in your code was that the work array was not allocated with proper size. After workspace query, you need to allocate the work array with the size returned by the query. After fixing this problem, I saw the performance improves almost 5x on an i7-2600. Please see it attached. I compiled it with: ifort /align:array32byte /Qopenmp /Qmkl source1.f90

Zhang

Vineet_Y_ · ‎03-30-2013

Thanks. I have assigned appropriate workspace and then its performance is better or equivalent to Matlab. The changes that I made in the source file are:

call dgetrf(a_rows,a_rows,a,a_rows,ipiv,info)
! workspace query
call dgetri(a_rows,a,a_rows,ipiv,work,-1,info)
nmax=work(1)
deallocate(work)
allocate(work(nmax))
! invert matrix a
call dgetri(a_rows,a,a_rows,ipiv,work,nmax,info)

Many thanks for you help

SergeyKostrov · ‎03-30-2013

Hi Vineet, >>...I have assigned appropriate workspace and then its performance is better or equivalent to Matlab. The changes that >>I made in the source file.. Could you upload a completely updated version of your test-case, please? Thanks in advance.

Vineet_Y_ · ‎04-01-2013

Hi Sergey

I have uploaded the code

Vineet

Zhang_Z_Intel · ‎04-01-2013

Vineet,

Another thing that can improve performance is proper data alignment. You'd align the dynamically allocated arrays to the 32-byte boundaries. There are multiple ways to do it. But the simplest one is probably passing "/align:array32byte" on your ifort command line.

Note Matlad uses many MKL functions internally. It's no surprise that the performance is not very different from one to another for some problems.

Zhang

Vineet_Y_ · ‎04-01-2013

Hi

Zhang

Does

statement like: !dec$ attributes align: 32 :: a , b where a and b are arrays help in aligning the data . I have seen some performance gain by using this statement with composer xe 2013 on Intel(R) Core(TM) i7-2600 CPU 3.40GHz for the test code/16GB of RAM machine

Can you let me know what kind of compiler switches can be used to align data with composer xe 2011

Vineet

Zhang_Z_Intel · ‎04-01-2013

The !dec$ attribute align:32 directive helps to align static and COMMON data. I'm not sure if it also applies to dynamically allocated data. The /align:array32byte compiler switch seems to work for all cases. This switch is available in Intel Composer XE 2011.

You can take a look at this article for more about data alignment in FORTRAN: http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization