Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
- Why matrix inversion (dpotrf & dpotri) faster than multiplication (dgemm) for matrix of same size?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

TonyNie

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-28-2021
02:24 AM

216 Views

Why matrix inversion (dpotrf & dpotri) faster than multiplication (dgemm) for matrix of same size?

Dear all,

I'm using Intel C++ Compiler 19.1 integrated on Visual Studio 2019 for MKL in Windows.

Recently, I found that matrix inversion (using LAPACKE_dpotrf & LAPACKE_dpotri) seems to be faster than the multiplication (using cblas_dgemm) for the same size N-by-N square matrix by a factor of 2. However, the total number of floating-point operations (flops) should be approximately the same for matrix inversion and multiplication, namely for inversion we have flops = 1/3 * (N^3) [dpotrf] + 2/3 * (N^3) [dpotri] = N^3, and for multiplication flops = N^w, with w<=3.0.

Following is the code I used for the time test:

// Matrix Inversion (N=3717)

LAPACKE_dpotrf(LAPACK_COL_MAJOR, 'U', N, MAT_A, N); // drop the first time call

LAPACKE_dpotri(LAPACK_COL_MAJOR, 'U', N, MAT_A, N);

time = dsecnd();

for (i = 0; i < COUNT; i++) // COUNT = 100

{

LAPACKE_dpotrf(LAPACK_COL_MAJOR, 'U', N, MAT_A, N);

LAPACKE_dpotri(LAPACK_COL_MAJOR, 'U', N, MAT_A, N);

}

time = dsecnd() - time;

T_INV = time / COUNT;

// Matrix Multiplication (MAT_C = MAT_A * MAT_B)

cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, N, N, N, 1.0, MAT_A, N, MAT_B, N, 0.0, MAT_C, N);

time = dsecnd();

for (i = 0; i < COUNT; i++)

cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, N, N, N, 1.0, MAT_A, N, MAT_B, N, 0.0, MAT_C, N);

time = dsecnd() - time;

T_MM = time / COUNT;

where, N is the matrix size N = 3717, and COUNT = 100.

The averaged time cost for N-by-N matrix inversion T_INV is about 0.79 seconds,

and the averaged time cost for two N-by-N matrices multiplication T_MM is about 1.57 seconds

We can observe clearly the inversion is faster than the multiplication, and I could not figure out why? probably due to that in the inversion, only the upper-triangular part is needed for calculation? Or my time cost test is not proper?

Thank you very much!

Best regards

Link Copied

13 Replies

RahulV_intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-01-2021
05:26 AM

188 Views

Hi,

Thanks for reporting this issue. I've forwarded your query to the MKL experts. They will get in touch with you.

Regards,

Rahul

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-02-2021
05:21 AM

178 Views

thanks for the case. What is the CPU type you are running in this case?

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-02-2021
06:59 AM

173 Views

here what i see on avx-512 based system, RH7, lp64 mode, mkl 2020 u4

./a.out

MKL_VERBOSE Intel(R) MKL 2020.0 Update 4 Product build 20200917 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), Lnx 2.90GHz lp64 intel_thread:

**Inversion =0.172078 ,sec**

**dgemm =0.0497837 ,sec**

TonyNie

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-02-2021
07:22 AM

168 Views

Hi, @Gennady_F_Intel

Many thanks for your answer and test!

I'm using the Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz 1.99 GHz with 16GB RAM on Windows 10 platform, and the compiler is Intel C++ 19.1.

The test was run on the DEBUG mode, and the matrix I used for inversion is the positive definite symmetric matrix with the size of 3717-by-3717.

So based on your test, the matrix inversion is about 3~4 times lower than the multiplication, and I would like to know what is the size of your square matrix?

Furthermore, it is still not quite clear to me that why the time cost of inversion and multiplication for the matrix of the same size should differ significantly? Do we have any general rules on the performance comparisons between MKL matrix inversion and multiplication function for the matrix of the same size, like the flop complexity or others?

Thank you very much!

Best regards

TonyNie

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-02-2021
07:22 AM

167 Views

Thank you!

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-02-2021
07:28 AM

161 Views

TonyNie

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-02-2021
08:00 AM

153 Views

Hi, @Gennady_F_Intel

Thank you!

However, I am not familiar with this MKL_VERBOSE setting, could you please give me some instructions on how to get this? I'm using the Windows 10 system with Visual Studio 2019 IDE.

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-02-2021
09:51 AM

149 Views

TonyNie

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-03-2021
03:08 AM

139 Views

Hi, @Gennady_F_Intel

Many thanks for the instruction!

Here are my mkl_verbose results, where 3 loops are used for inversion (DPOTRF & DPOTRI) and multiplication (DGEMM), and it seems the multiplication is still slower than the inversion for my case.

MKL_VERBOSE Intel(R) MKL 2020.0 Update 4 Product build 20200917 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Win 1.80GHz cdecl intel_thread

MKL_VERBOSE DPOTRF(U,3717,00000240BEF82070,3717,0) 206.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4

MKL_VERBOSE DPOTRI(U,3717,00000240BEF82070,3717,0) 580.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4

MKL_VERBOSE DPOTRF(U,3717,00000240BEF82070,3717,0) 188.61ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4

MKL_VERBOSE DPOTRI(U,3717,00000240BEF82070,3717,0) 455.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4

MKL_VERBOSE DPOTRF(U,3717,00000240BEF82070,3717,0) 188.05ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4

MKL_VERBOSE DPOTRI(U,3717,00000240BEF82070,3717,0) 496.18ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4

MKL_VERBOSE DGEMM(N,N,3717,3717,3717,000000F2FBCFF550,00000240BEF82070,3717,00000240C58F4070,3717,000000F2FBCFF578,00000240CC26E070,3717) 949.66ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4

MKL_VERBOSE DGEMM(N,N,3717,3717,3717,000000F2FBCFF550,00000240BEF82070,3717,00000240C58F4070,3717,000000F2FBCFF578,00000240CC26E070,3717) 942.71ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4

MKL_VERBOSE DGEMM(N,N,3717,3717,3717,000000F2FBCFF550,00000240BEF82070,3717,00000240C58F4070,3717,000000F2FBCFF578,00000240CC26E070,3717) 961.10ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4

Could you figure out the reason for that?

Thank you very much!

Best regards

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-04-2021
04:43 AM

119 Views

TonyNie

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-04-2021
06:59 AM

113 Views

Hi, @Gennady_F_Intel

Yes, I was running the codes on the Windows OS. The CPU of my laptop is Intel i7-8565U with #4 cores and #8 threads. Thank you~

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-05-2021
03:29 AM

70 Views

Ok, I reproduced the problem with the avx2 code branch on lin os as well. I guess that the gemm is not well optimized for the avx2 code path for some specific input problem sizes. You may escalate this issue to the official Intel Online Service Center.

Running the same code on AVX-512 based system on 1K-10K problem size, I see that gemm outperforms potrf/potri.

Regards,

Gennady

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-25-2021
08:37 PM

31 Views

For more complete information about compiler optimizations, see our Optimization Notice.