Massive Slowdown in cblas_scal for Intel 2020/2021

John_Young · ‎01-21-2021

Hi,

We've tracked a severe performance hit in our codes to the function cblas_scal. The efficiency hit shows up starting in Intel MKL 2020 and still occurs with Intel 2021. It seems to occur when calling cblas_cscal from a threaded region and does not seem to occur when calling cblas_cscal from a non-threaded region.

Attached is a test case that we ran on our linux cluster with results for Intel 2018, 2019, 2020, and 2021. We compiled with

icc -O2 -qopenmp -mkl cblas_test.cpp

The basic timing results are (with 16 threads in the thread regions):

MKL VERSION                         2018.0.03     2019.0.4        2020.0.4     2021.1
TIME(s) for Non-Threaded:       20.07              23.42        20.68         20.08
TIME(s) for Threaded        :          1.67                1.89             38.54         38.30

Note the catastrophic slowdown in 2020/2021 when cblas_cscal is called from threading where it is even slower than the non-threaded loop and is almost 20 times slower than the corresponding times in 2018/2019.

Thanks,

John

John_Young · ‎01-21-2021

The test case in my original post had some errors. The single complex version cblas_cscal was being called instead of the the double real version. I've attached a corrected test case compiled with:

icc -qopenmp -O2 -mkl cblas_test.cpp scale.cpp

The efficiency issue still remains.

I also tested cblas_dcopy and I see the same type of issue. For 2018/2019, dcopy in the threaded loops shows significant improvement over the non-threaded loops. However, for 2020/2021 calling dcopy in the threaded loops is twice as slow as calling dcopy from the non-threaded loops. If you replace dcopy with direct copies, then no slowdown is observed.

mecej4 · ‎01-21-2021

Here are timings (seconds) on an Intel NUC with an i7-10710U processor (6 cores, 12 threads) running Windows 10-64.

	2020.0.4	2021.1
Non-Threaded	13.68	12.87
Threaded	14.79	13.92

John_Young · ‎01-22-2021

Hi mecej4,

The threaded timings are not nearly as poor as on our linux cluster. However, they indicate no improvement by threading. Is it possible that you can run data for Intel 2018 and Intel 2019 to verify that the threading produces efficient results?

The results I see indicate that there is some (major) threading issue that crept into MKL somewhere in the 2020 release (not sure which exact update).

I've also been able to verify that the same issue arises with the la_getrs function. I'm guessing this means it probably affects many MKL functions.

John

John_Young · ‎01-22-2021

Here are timings if I replace the MKL cblas_dscal call with a plain loop

MKL VERSION                          2018.0.03      2019.0.4       2020.0.4       2021.1
TIME(s) for Non-Threaded:        9.21 8.50                8.49             8.49
TIME(s) for Threaded :                0.89               0.79                0.81             0.82

Actually, in this case, there really isn't any mkl involved. But, this is to help show that there is an issue with threading efficiency in MKL 2020 and 2021 that was not present in 2019 and before.

For such simple OpenMP code, there is no reason that MKL 2020 and 2021 blas functions shouldn't have much better threading efficiency.

mecej4 · ‎01-22-2021

John_Young:

The only older Parallel Studio version that I have installed on this rather new Intel NUC is 2013 SP1, dated circa 2014 -- I felt that the 2017-2019 versions were not worth copying and reinstalling from a now-retired PC, but the 2013 SP1 was needed to support some other software that I use.

I changed one line in your program:

std::cout << std::string(buf) << "\n";

to

std::cout << buf << "\n";

in order to please the older Intel C compiler, but that should not affect the timings.

Here are the results:

S:\LANG\MKL>cblas_test
Intel(R) Math Kernel Library Version 11.1.4 Product Build 20140806 for Intel(R) 64 architecture applications
OpenMP procs/maxThreads = 12 / 12
TIME for Non-Threaded: 10.326
TIME for Threaded: 2.42484
DONE

These results strongly reinforce your findings on Linux.

John_Young · ‎01-22-2021

Thanks for checking an earlier version.

Could this issue be escalated to the development team?

RahulV_intel · ‎01-25-2021

Hi,

Thanks for reporting this issue. We are forwarding this query to the MKL experts. They will get back to you.

Thanks,

Rahul

Khang_N_Intel · ‎05-24-2021

I tested on both Windows and Linux systems. I was to reproduce the issue on the Windows system, not on the Linux system:

MKL Version	2020.0.4	2021.2.0
Windows 10
Non-Threaded	17.65	16.2
Threaded	19.97	18.61

Linux Ubuntu 18.4 LTS
Non-Threaded	14.52	14.62
Threaded	13.79	13.52

I was not able to get access to a cluster to test. I just tested on single-processors systems.

I was able to reproduce the issue on the Windows system, not on the Linux system.

John_Young · ‎06-01-2021

Hi Khang,

I apologize I could not reply sooner. Thank you for looking at this problem. This is still a major issue for our codes.

Without knowing how many threads you were using, I cannot say 100%, but I would say that both your Windows and Linux systems exhibit the issue. For example, if you were using two threads on Linux, then your threaded timing should have dropped to around 8 seconds (if 4 threads, then you should see timings around 4 seconds). Even though you saw a small speedup on Linux instead of a slowdown, the parallel efficiency is terrible.

If you are able, please try to run the same simulation using Intel MKL 2019. I think you would see the threaded timings drop significantly.

Thank you,

John

Khang_N_Intel · ‎06-01-2021

Hi John,

The issue will be addressed in the upcoming release of oneMKL, 2021.3.

Khang

John_Young · ‎06-01-2021

Khang,

That is good news. Thanks for letting us know.

John

MRajesh_intel · ‎06-30-2021

Hi,

The oneMKL 2021.3 version is now available to download. Can you please try on the latest version and let us know if the issue still persists?

Regards

Rajesh.

MRajesh_intel · ‎07-07-2021

Hi,

Can you please provide an update regarding the issue?

Regards

Rajesh.

John_Young · ‎07-07-2021

Hi Rajesh,

We have put in a request, but we are still waiting on our system administrators to install the latest Intel libraries. As soon as the libraries are installed, I'll verify the issue is fixed.

Best,

John

John_Young · ‎07-12-2021

Rajesh,

The issue seems to be fixed in 2021.3. Thanks for your help.

Best,

John

MRajesh_intel · ‎07-12-2021

Hi,

Thanks for the confirmation!

As this issue has been resolved, we will no longer respond to this thread. If you require any additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Have a Good day.

Regards

Rajesh

John_Young · ‎07-12-2021

Here are the final timings (in seconds) with the fixed code:

cblas_dscal
MKL Version : 2018.0.3 2019.0.4 2020.0.4 2021.1   2021.3
Non-Threaded: 19.59     22.33    17.84    18.38 17.75
Threaded    : 1.60    2.09 37.56    37.97    1.52

cblas_dcopy
MKL Version : 2018.0.3 2019.0.4 2020.0.4 2021.1   2021.3
Non-Threaded: 20.55    24.77     23.26 22.82    22.75
Threaded    : 1.81     2.27     42.54     42.22     1.91