Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6957 Discussions

Massive Slowdown in cblas_scal for Intel 2020/2021

John_Young
New Contributor I
2,058 Views

Hi,

We've tracked a severe performance hit in our codes to the function cblas_scal.  The efficiency hit shows up starting in Intel MKL 2020 and still occurs with Intel 2021.  It seems to occur when calling cblas_cscal from a threaded region and does not seem to occur when calling cblas_cscal from a non-threaded region. 

Attached is a test case that we ran on our linux cluster with results for Intel 2018, 2019, 2020, and 2021.   We compiled with

     icc -O2 -qopenmp -mkl cblas_test.cpp 

The basic timing results are (with 16 threads in the thread regions):

MKL VERSION                         2018.0.03     2019.0.4        2020.0.4     2021.1
TIME(s) for Non-Threaded:       20.07              23.42            20.68         20.08
TIME(s) for Threaded        :          1.67                1.89             38.54         38.30

Note the catastrophic slowdown in 2020/2021 when cblas_cscal is called from threading where it is even slower than the non-threaded loop and is almost 20 times slower than the corresponding times in 2018/2019.

Thanks,

John

 

17 Replies
John_Young
New Contributor I
2,037 Views

The test case in my original post had some errors.  The single complex version cblas_cscal was being called instead of the the double real version.  I've attached a corrected test case compiled with:

    icc -qopenmp -O2 -mkl  cblas_test.cpp scale.cpp

The efficiency issue still remains. 

I also tested cblas_dcopy and I see the same type of issue.  For 2018/2019, dcopy in the threaded loops shows significant improvement over the non-threaded loops.  However, for 2020/2021 calling dcopy in the threaded loops is twice as slow as calling dcopy from the non-threaded loops.  If you replace dcopy with direct copies, then no slowdown is observed.

0 Kudos
mecej4
Honored Contributor III
2,023 Views

Here are timings (seconds) on an Intel NUC with an i7-10710U processor (6 cores, 12 threads) running Windows 10-64.

  2020.0.4 2021.1
Non-Threaded 13.68 12.87
Threaded 14.79 13.92
0 Kudos
John_Young
New Contributor I
2,001 Views

Hi mecej4,

The threaded timings are not nearly as poor as on our linux cluster. However, they indicate no improvement by threading.  Is it possible that you can run data for Intel 2018 and Intel 2019 to verify that the threading produces efficient results? 

The results I see indicate that there is some (major) threading issue that crept into MKL somewhere in the 2020 release (not sure which exact update).

I've also been able to verify that the same issue arises with the la_getrs function.  I'm guessing this means it probably affects many MKL functions.

John

0 Kudos
John_Young
New Contributor I
1,995 Views

Here are timings if I replace the MKL cblas_dscal call with a plain loop

MKL VERSION                          2018.0.03      2019.0.4       2020.0.4       2021.1
TIME(s) for Non-Threaded:        9.21               8.50                8.49             8.49
TIME(s) for Threaded :                0.89               0.79                0.81             0.82

Actually, in this case, there really isn't any mkl involved. But, this is to help show that there is an issue with threading efficiency in MKL 2020 and 2021 that was not present in 2019 and before.

For such simple OpenMP code, there is no reason that MKL 2020 and 2021 blas functions shouldn't have much better threading efficiency.

 

0 Kudos
mecej4
Honored Contributor III
1,992 Views

John_Young:

The only older Parallel Studio version that I have installed on this rather new Intel NUC is 2013 SP1, dated circa 2014 -- I felt that the 2017-2019 versions were not worth copying and reinstalling from a now-retired PC, but the 2013 SP1 was needed to support some other software that I use.

I changed one line in your program:

std::cout << std::string(buf) << "\n";

to

std::cout << buf << "\n";

in order to please the older Intel C compiler, but that should not affect the timings.

Here are the results:

S:\LANG\MKL>cblas_test
Intel(R) Math Kernel Library Version 11.1.4 Product Build 20140806 for Intel(R) 64 architecture applications
OpenMP procs/maxThreads = 12 / 12
TIME for Non-Threaded: 10.326
TIME for Threaded: 2.42484
DONE

These results strongly reinforce your findings on Linux.

 

0 Kudos
John_Young
New Contributor I
1,976 Views

Thanks for checking an earlier version.

Could this issue be escalated to the development team?

0 Kudos
RahulV_intel
Moderator
1,946 Views

Hi,


Thanks for reporting this issue. We are forwarding this query to the MKL experts. They will get back to you.


Thanks,

Rahul


0 Kudos
Khang_N_Intel
Employee
1,737 Views

I tested on both Windows and Linux systems. I was to reproduce the issue on the Windows system, not on the Linux system:


MKL Version 2020.0.4 2021.2.0
Windows 10 
Non-Threaded 17.65 16.2
Threaded 19.97 18.61
Linux Ubuntu 18.4 LTS
Non-Threaded 14.52 14.62
Threaded 13.79 13.52


I was not able to get access to a cluster to test. I just tested on single-processors systems.

I was able to reproduce the issue on the Windows system, not on the Linux system.


0 Kudos
John_Young
New Contributor I
1,714 Views

Hi Khang,

 

I apologize I could not reply sooner. Thank you for looking at this problem.  This is still a major issue for our codes.

 

Without knowing how many threads you were using, I cannot say 100%, but I would say that both your Windows and Linux systems exhibit the issue.  For example, if you were using two threads on Linux, then your threaded timing should have dropped to around 8 seconds (if 4 threads, then you should see timings around 4 seconds).  Even though you saw a small speedup on Linux instead of a slowdown, the parallel efficiency is terrible. 

 

If you are able, please try to run the same simulation using Intel MKL 2019.  I think you would see the threaded timings drop significantly. 

 

Thank you,

John

0 Kudos
Khang_N_Intel
Employee
1,704 Views

Hi John,

The issue will be addressed in the upcoming release of oneMKL, 2021.3.

Khang


0 Kudos
John_Young
New Contributor I
1,702 Views

Khang,

 

That is good news. Thanks for letting us know.

 

John

 

0 Kudos
MRajesh_intel
Moderator
1,623 Views

Hi,

 

The oneMKL 2021.3 version is now available to download. Can you please try on the latest version and let us know if the issue still persists?

 

Regards

Rajesh.

 

0 Kudos
MRajesh_intel
Moderator
1,581 Views

Hi,


Can you please provide an update regarding the issue?


Regards

Rajesh.


0 Kudos
John_Young
New Contributor I
1,570 Views

Hi Rajesh,

 

We have put in a request, but we are still waiting on our system administrators to install the latest Intel libraries.    As soon as the libraries are installed, I'll verify the issue is fixed.

Best,

John

0 Kudos
John_Young
New Contributor I
1,537 Views

Rajesh,

 

The issue seems to be fixed in 2021.3.  Thanks for your help.


Best,

John

0 Kudos
MRajesh_intel
Moderator
1,533 Views

 Hi,


Thanks for the confirmation!


As this issue has been resolved, we will no longer respond to this thread. If you require any additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.


Have a Good day.


Regards

Rajesh


0 Kudos
John_Young
New Contributor I
1,519 Views

Here are the final timings (in seconds) with the fixed code:

cblas_dscal
MKL Version : 2018.0.3  2019.0.4  2020.0.4  2021.1   2021.3
Non-Threaded: 19.59       22.33     17.84    18.38   17.75
Threaded    :  1.60        2.09     37.56    37.97    1.52


cblas_dcopy
MKL Version : 2018.0.3  2019.0.4  2020.0.4  2021.1   2021.3
Non-Threaded: 20.55      24.77     23.26     22.82    22.75
Threaded    :  1.81       2.27     42.54     42.22     1.91

 

 

0 Kudos
Reply