Why matrix inversion (dpotrf & dpotri) faster than multiplication (dgemm) for matrix of same size?

TonyNie · ‎01-28-2021

Dear all,

I'm using Intel C++ Compiler 19.1 integrated on Visual Studio 2019 for MKL in Windows.

Recently, I found that matrix inversion (using LAPACKE_dpotrf & LAPACKE_dpotri) seems to be faster than the multiplication (using cblas_dgemm) for the same size N-by-N square matrix by a factor of 2. However, the total number of floating-point operations (flops) should be approximately the same for matrix inversion and multiplication, namely for inversion we have flops = 1/3 * (N^3) [dpotrf] + 2/3 * (N^3) [dpotri] = N^3, and for multiplication flops = N^w, with w<=3.0.

Following is the code I used for the time test:

// Matrix Inversion (N=3717)
LAPACKE_dpotrf(LAPACK_COL_MAJOR, 'U', N, MAT_A, N); // drop the first time call
LAPACKE_dpotri(LAPACK_COL_MAJOR, 'U', N, MAT_A, N);
time = dsecnd();
for (i = 0; i < COUNT; i++) // COUNT = 100
{
LAPACKE_dpotrf(LAPACK_COL_MAJOR, 'U', N, MAT_A, N);
LAPACKE_dpotri(LAPACK_COL_MAJOR, 'U', N, MAT_A, N);
}
time = dsecnd() - time;
T_INV = time / COUNT;

// Matrix Multiplication (MAT_C = MAT_A * MAT_B)
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, N, N, N, 1.0, MAT_A, N, MAT_B, N, 0.0, MAT_C, N);
time = dsecnd();
for (i = 0; i < COUNT; i++)
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, N, N, N, 1.0, MAT_A, N, MAT_B, N, 0.0, MAT_C, N);
time = dsecnd() - time;
T_MM = time / COUNT;

where, N is the matrix size N = 3717, and COUNT = 100.

The averaged time cost for N-by-N matrix inversion T_INV is about 0.79 seconds,

and the averaged time cost for two N-by-N matrices multiplication T_MM is about 1.57 seconds

We can observe clearly the inversion is faster than the multiplication, and I could not figure out why? probably due to that in the inversion, only the upper-triangular part is needed for calculation? Or my time cost test is not proper?

Thank you very much!

Best regards

RahulV_intel · ‎02-01-2021

Hi,

Thanks for reporting this issue. I've forwarded your query to the MKL experts. They will get in touch with you.

Regards,

Rahul

TonyNie · ‎02-02-2021

@RahulV_intel

Thank you!

Gennady_F_Intel · ‎02-02-2021

thanks for the case. What is the CPU type you are running in this case?

TonyNie · ‎02-02-2021

Hi, @Gennady_F_Intel

Many thanks for your answer and test!

I'm using the Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz 1.99 GHz with 16GB RAM on Windows 10 platform, and the compiler is Intel C++ 19.1.

The test was run on the DEBUG mode, and the matrix I used for inversion is the positive definite symmetric matrix with the size of 3717-by-3717.

So based on your test, the matrix inversion is about 3~4 times lower than the multiplication, and I would like to know what is the size of your square matrix?

Furthermore, it is still not quite clear to me that why the time cost of inversion and multiplication for the matrix of the same size should differ significantly? Do we have any general rules on the performance comparisons between MKL matrix inversion and multiplication function for the matrix of the same size, like the flop complexity or others?

Thank you very much!

Best regards

Gennady_F_Intel · ‎02-02-2021

here what i see on avx-512 based system, RH7, lp64 mode, mkl 2020 u4

./a.out

MKL_VERBOSE Intel(R) MKL 2020.0 Update 4 Product build 20200917 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), Lnx 2.90GHz lp64 intel_thread:

Inversion =0.172078 ,sec

dgemm =0.0497837 ,sec

Gennady_F_Intel · ‎02-02-2021

yes, I used exactly the same sizes( square case) and #of loops as you pointed out. You may set/export the MKL_VERBOSE environment variable and give us the very first lines of the output to check the exact version of mkl do you run.

TonyNie · ‎02-02-2021

Hi, @Gennady_F_Intel

Thank you!

However, I am not familiar with this MKL_VERBOSE setting, could you please give me some instructions on how to get this? I'm using the Windows 10 system with Visual Studio 2019 IDE.

Gennady_F_Intel · ‎02-02-2021

you may find out this info in mkl's developer guide. If you run the under the VS IDE, then you may call mkl_verbose(true) on the top of your mkl's call and see the output.

TonyNie · ‎02-03-2021

Hi, @Gennady_F_Intel

Many thanks for the instruction!

Here are my mkl_verbose results, where 3 loops are used for inversion (DPOTRF & DPOTRI) and multiplication (DGEMM), and it seems the multiplication is still slower than the inversion for my case.

MKL_VERBOSE Intel(R) MKL 2020.0 Update 4 Product build 20200917 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Win 1.80GHz cdecl intel_thread

MKL_VERBOSE DPOTRF(U,3717,00000240BEF82070,3717,0) 206.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
MKL_VERBOSE DPOTRI(U,3717,00000240BEF82070,3717,0) 580.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
MKL_VERBOSE DPOTRF(U,3717,00000240BEF82070,3717,0) 188.61ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
MKL_VERBOSE DPOTRI(U,3717,00000240BEF82070,3717,0) 455.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
MKL_VERBOSE DPOTRF(U,3717,00000240BEF82070,3717,0) 188.05ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
MKL_VERBOSE DPOTRI(U,3717,00000240BEF82070,3717,0) 496.18ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4

MKL_VERBOSE DGEMM(N,N,3717,3717,3717,000000F2FBCFF550,00000240BEF82070,3717,00000240C58F4070,3717,000000F2FBCFF578,00000240CC26E070,3717) 949.66ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
MKL_VERBOSE DGEMM(N,N,3717,3717,3717,000000F2FBCFF550,00000240BEF82070,3717,00000240C58F4070,3717,000000F2FBCFF578,00000240CC26E070,3717) 942.71ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4
MKL_VERBOSE DGEMM(N,N,3717,3717,3717,000000F2FBCFF550,00000240BEF82070,3717,00000240C58F4070,3717,000000F2FBCFF578,00000240CC26E070,3717) 961.10ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4

Could you figure out the reason for that?

Thank you very much!

Best regards

Gennady_F_Intel · ‎02-04-2021

ok, I see the verbose shows that avx2 code path is called. I see you run only 4 threads. Is this windows OS?

TonyNie · ‎02-04-2021

Hi, @Gennady_F_Intel

Yes, I was running the codes on the Windows OS. The CPU of my laptop is Intel i7-8565U with #4 cores and #8 threads. Thank you~

Gennady_F_Intel · ‎02-05-2021

Ok, I reproduced the problem with the avx2 code branch on lin os as well. I guess that the gemm is not well optimized for the avx2 code path for some specific input problem sizes. You may escalate this issue to the official Intel Online Service Center.

Running the same code on AVX-512 based system on 1K-10K problem size, I see that gemm outperforms potrf/potri.

Regards,

Gennady

Gennady_F_Intel · ‎02-25-2021

The issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.