Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.
6743 Discussions

Why matrix inversion (dpotrf & dpotri) faster than multiplication (dgemm) for matrix of same size?

TonyNie
Beginner
683 Views

Dear all,

I'm using Intel C++ Compiler 19.1 integrated on Visual Studio 2019 for MKL in Windows.

Recently, I found that matrix inversion (using LAPACKE_dpotrf & LAPACKE_dpotri) seems to be faster than the multiplication (using cblas_dgemm) for the same size N-by-N square matrix by a factor of 2. However, the total number of floating-point operations (flops) should be approximately the same for matrix inversion and multiplication, namely for inversion we have flops = 1/3 * (N^3) [dpotrf] + 2/3 * (N^3) [dpotri] = N^3, and for multiplication flops = N^w, with w<=3.0.

Following is the code I used for the time test:

// Matrix Inversion (N=3717)
LAPACKE_dpotrf(LAPACK_COL_MAJOR, 'U', N, MAT_A, N);  // drop the first time call
LAPACKE_dpotri(LAPACK_COL_MAJOR, 'U', N, MAT_A, N);
time = dsecnd();
for (i = 0; i < COUNT; i++)     // COUNT = 100
{
LAPACKE_dpotrf(LAPACK_COL_MAJOR, 'U', N, MAT_A, N); 
LAPACKE_dpotri(LAPACK_COL_MAJOR, 'U', N, MAT_A, N); 
}
time = dsecnd() - time;
T_INV = time / COUNT;

// Matrix Multiplication (MAT_C = MAT_A * MAT_B)
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, N, N, N, 1.0, MAT_A, N, MAT_B, N, 0.0, MAT_C, N);
time = dsecnd();
for (i = 0; i < COUNT; i++)
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, N, N, N, 1.0, MAT_A, N, MAT_B, N, 0.0, MAT_C, N);
time = dsecnd() - time;
T_MM = time / COUNT;

where, N is the matrix size N = 3717, and COUNT = 100.

The averaged time cost for N-by-N matrix inversion T_INV is about 0.79 seconds,

and the averaged time cost for two N-by-N matrices multiplication T_MM is about 1.57 seconds

We can observe clearly the inversion is faster than the multiplication, and I could not figure out why? probably due to that in the inversion, only the upper-triangular part is needed for calculation? Or my time cost test is not proper?

Thank you very much!

Best regards

 

 

 
0 Kudos
13 Replies
RahulV_intel
Moderator
655 Views

Hi,


Thanks for reporting this issue. I've forwarded your query to the MKL experts. They will get in touch with you.


Regards,

Rahul


TonyNie
Beginner
634 Views
Gennady_F_Intel
Moderator
645 Views

thanks for the case. What is the CPU type you are running in this case?

 

TonyNie
Beginner
635 Views

Hi, @Gennady_F_Intel

Many thanks for your answer and test!

I'm using the  Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz 1.99 GHz with 16GB RAM on Windows 10 platform, and the compiler is Intel C++ 19.1. 

The test was run on the DEBUG mode, and the matrix I used for inversion is the positive definite symmetric matrix with the size of 3717-by-3717.

So based on your test, the matrix inversion is about 3~4 times lower than the multiplication, and I would like to know what is the size of your square matrix? 

Furthermore, it is still not quite clear to me that why the time cost of inversion and multiplication for the matrix of the same size should differ significantly? Do we have any general rules on the performance comparisons between MKL matrix inversion and multiplication function for the matrix of the same size, like the flop complexity or others?

Thank you very much!

Best regards

 
Gennady_F_Intel
Moderator
640 Views

here what i see on avx-512 based system, RH7, lp64 mode, mkl 2020 u4

./a.out

MKL_VERBOSE Intel(R) MKL 2020.0 Update 4 Product build 20200917 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), Lnx 2.90GHz lp64 intel_thread:

Inversion =0.172078 ,sec

dgemm   =0.0497837 ,sec




Gennady_F_Intel
Moderator
628 Views

yes, I used exactly the same sizes( square case) and #of loops as you pointed out. You may set/export the MKL_VERBOSE environment variable and give us the very first lines of the output to check the exact version of mkl do you run.


TonyNie
Beginner
620 Views

Hi, @Gennady_F_Intel 

Thank you!

However, I am not familiar with this MKL_VERBOSE setting, could you please give me some instructions on how to get this? I'm using the Windows 10 system with Visual Studio 2019 IDE.

Gennady_F_Intel
Moderator
616 Views

you may find out this info in mkl's developer guide. If you run the under the VS IDE, then you may call mkl_verbose(true) on the top of your mkl's call and see the output.


TonyNie
Beginner
606 Views

Hi, @Gennady_F_Intel 

Many thanks for the instruction!

Here are my mkl_verbose results, where 3 loops are used for inversion (DPOTRF & DPOTRI) and multiplication (DGEMM), and it seems the multiplication is still slower than the inversion for my case.

MKL_VERBOSE Intel(R) MKL 2020.0 Update 4 Product build 20200917 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Win 1.80GHz cdecl intel_thread

MKL_VERBOSE DPOTRF(U,3717,00000240BEF82070,3717,0) 206.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
MKL_VERBOSE DPOTRI(U,3717,00000240BEF82070,3717,0) 580.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
MKL_VERBOSE DPOTRF(U,3717,00000240BEF82070,3717,0) 188.61ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
MKL_VERBOSE DPOTRI(U,3717,00000240BEF82070,3717,0) 455.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
MKL_VERBOSE DPOTRF(U,3717,00000240BEF82070,3717,0) 188.05ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
MKL_VERBOSE DPOTRI(U,3717,00000240BEF82070,3717,0) 496.18ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4

MKL_VERBOSE DGEMM(N,N,3717,3717,3717,000000F2FBCFF550,00000240BEF82070,3717,00000240C58F4070,3717,000000F2FBCFF578,00000240CC26E070,3717) 949.66ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
MKL_VERBOSE DGEMM(N,N,3717,3717,3717,000000F2FBCFF550,00000240BEF82070,3717,00000240C58F4070,3717,000000F2FBCFF578,00000240CC26E070,3717) 942.71ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
MKL_VERBOSE DGEMM(N,N,3717,3717,3717,000000F2FBCFF550,00000240BEF82070,3717,00000240C58F4070,3717,000000F2FBCFF578,00000240CC26E070,3717) 961.10ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4

Could you figure out the reason for that?

Thank you very much!

Best regards

 

 
Gennady_F_Intel
Moderator
586 Views

ok, I see the verbose shows that avx2 code path is called. I see you run only 4 threads. Is this windows OS?

TonyNie
Beginner
580 Views

Hi, @Gennady_F_Intel 

Yes, I was running the codes on the Windows OS. The CPU of my laptop is Intel i7-8565U with #4 cores and #8 threads. Thank you~

 

 
Gennady_F_Intel
Moderator
537 Views

Ok, I reproduced the problem with the avx2 code branch on lin os as well. I guess that the gemm is not well optimized for the avx2 code path for some specific input problem sizes. You may escalate this issue to the official Intel Online Service Center.

Running the same code on AVX-512 based system on 1K-10K problem size, I see that gemm outperforms potrf/potri.

Regards,

Gennady


Gennady_F_Intel
Moderator
498 Views

The issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.



Reply