Community
cancel
Showing results for 
Search instead for 
Did you mean: 
John_Young
New Contributor I
228 Views

Massive Slowdown in cblas_scal for Intel 2020/2021

Hi,

We've tracked a severe performance hit in our codes to the function cblas_scal.  The efficiency hit shows up starting in Intel MKL 2020 and still occurs with Intel 2021.  It seems to occur when calling cblas_cscal from a threaded region and does not seem to occur when calling cblas_cscal from a non-threaded region. 

Attached is a test case that we ran on our linux cluster with results for Intel 2018, 2019, 2020, and 2021.   We compiled with

     icc -O2 -qopenmp -mkl cblas_test.cpp 

The basic timing results are (with 16 threads in the thread regions):

MKL VERSION                         2018.0.03     2019.0.4        2020.0.4     2021.1
TIME(s) for Non-Threaded:       20.07              23.42            20.68         20.08
TIME(s) for Threaded        :          1.67                1.89             38.54         38.30

Note the catastrophic slowdown in 2020/2021 when cblas_cscal is called from threading where it is even slower than the non-threaded loop and is almost 20 times slower than the corresponding times in 2018/2019.

Thanks,

John

 

7 Replies
John_Young
New Contributor I
207 Views

The test case in my original post had some errors.  The single complex version cblas_cscal was being called instead of the the double real version.  I've attached a corrected test case compiled with:

    icc -qopenmp -O2 -mkl  cblas_test.cpp scale.cpp

The efficiency issue still remains. 

I also tested cblas_dcopy and I see the same type of issue.  For 2018/2019, dcopy in the threaded loops shows significant improvement over the non-threaded loops.  However, for 2020/2021 calling dcopy in the threaded loops is twice as slow as calling dcopy from the non-threaded loops.  If you replace dcopy with direct copies, then no slowdown is observed.

mecej4
Black Belt
193 Views

Here are timings (seconds) on an Intel NUC with an i7-10710U processor (6 cores, 12 threads) running Windows 10-64.

  2020.0.4 2021.1
Non-Threaded 13.68 12.87
Threaded 14.79 13.92
John_Young
New Contributor I
171 Views

Hi mecej4,

The threaded timings are not nearly as poor as on our linux cluster. However, they indicate no improvement by threading.  Is it possible that you can run data for Intel 2018 and Intel 2019 to verify that the threading produces efficient results? 

The results I see indicate that there is some (major) threading issue that crept into MKL somewhere in the 2020 release (not sure which exact update).

I've also been able to verify that the same issue arises with the la_getrs function.  I'm guessing this means it probably affects many MKL functions.

John

John_Young
New Contributor I
165 Views

Here are timings if I replace the MKL cblas_dscal call with a plain loop

MKL VERSION                          2018.0.03      2019.0.4       2020.0.4       2021.1
TIME(s) for Non-Threaded:        9.21               8.50                8.49             8.49
TIME(s) for Threaded :                0.89               0.79                0.81             0.82

Actually, in this case, there really isn't any mkl involved. But, this is to help show that there is an issue with threading efficiency in MKL 2020 and 2021 that was not present in 2019 and before.

For such simple OpenMP code, there is no reason that MKL 2020 and 2021 blas functions shouldn't have much better threading efficiency.

 

mecej4
Black Belt
162 Views

John_Young:

The only older Parallel Studio version that I have installed on this rather new Intel NUC is 2013 SP1, dated circa 2014 -- I felt that the 2017-2019 versions were not worth copying and reinstalling from a now-retired PC, but the 2013 SP1 was needed to support some other software that I use.

I changed one line in your program:

std::cout << std::string(buf) << "\n";

to

std::cout << buf << "\n";

in order to please the older Intel C compiler, but that should not affect the timings.

Here are the results:

S:\LANG\MKL>cblas_test
Intel(R) Math Kernel Library Version 11.1.4 Product Build 20140806 for Intel(R) 64 architecture applications
OpenMP procs/maxThreads = 12 / 12
TIME for Non-Threaded: 10.326
TIME for Threaded: 2.42484
DONE

These results strongly reinforce your findings on Linux.

 

John_Young
New Contributor I
150 Views

Thanks for checking an earlier version.

Could this issue be escalated to the development team?

RahulV_intel
Moderator
120 Views

Hi,


Thanks for reporting this issue. We are forwarding this query to the MKL experts. They will get back to you.


Thanks,

Rahul