- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We've tracked a severe performance hit in our codes to the function cblas_scal. The efficiency hit shows up starting in Intel MKL 2020 and still occurs with Intel 2021. It seems to occur when calling cblas_cscal from a threaded region and does not seem to occur when calling cblas_cscal from a non-threaded region.
Attached is a test case that we ran on our linux cluster with results for Intel 2018, 2019, 2020, and 2021. We compiled with
icc -O2 -qopenmp -mkl cblas_test.cpp
The basic timing results are (with 16 threads in the thread regions):
MKL VERSION 2018.0.03 2019.0.4 2020.0.4 2021.1
TIME(s) for Non-Threaded: 20.07 23.42 20.68 20.08
TIME(s) for Threaded : 1.67 1.89 38.54 38.30
Note the catastrophic slowdown in 2020/2021 when cblas_cscal is called from threading where it is even slower than the non-threaded loop and is almost 20 times slower than the corresponding times in 2018/2019.
Thanks,
John
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The test case in my original post had some errors. The single complex version cblas_cscal was being called instead of the the double real version. I've attached a corrected test case compiled with:
icc -qopenmp -O2 -mkl cblas_test.cpp scale.cpp
The efficiency issue still remains.
I also tested cblas_dcopy and I see the same type of issue. For 2018/2019, dcopy in the threaded loops shows significant improvement over the non-threaded loops. However, for 2020/2021 calling dcopy in the threaded loops is twice as slow as calling dcopy from the non-threaded loops. If you replace dcopy with direct copies, then no slowdown is observed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are timings (seconds) on an Intel NUC with an i7-10710U processor (6 cores, 12 threads) running Windows 10-64.
2020.0.4 | 2021.1 | |
Non-Threaded | 13.68 | 12.87 |
Threaded | 14.79 | 13.92 |
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi mecej4,
The threaded timings are not nearly as poor as on our linux cluster. However, they indicate no improvement by threading. Is it possible that you can run data for Intel 2018 and Intel 2019 to verify that the threading produces efficient results?
The results I see indicate that there is some (major) threading issue that crept into MKL somewhere in the 2020 release (not sure which exact update).
I've also been able to verify that the same issue arises with the la_getrs function. I'm guessing this means it probably affects many MKL functions.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are timings if I replace the MKL cblas_dscal call with a plain loop
MKL VERSION 2018.0.03 2019.0.4 2020.0.4 2021.1
TIME(s) for Non-Threaded: 9.21 8.50 8.49 8.49
TIME(s) for Threaded : 0.89 0.79 0.81 0.82
Actually, in this case, there really isn't any mkl involved. But, this is to help show that there is an issue with threading efficiency in MKL 2020 and 2021 that was not present in 2019 and before.
For such simple OpenMP code, there is no reason that MKL 2020 and 2021 blas functions shouldn't have much better threading efficiency.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John_Young:
The only older Parallel Studio version that I have installed on this rather new Intel NUC is 2013 SP1, dated circa 2014 -- I felt that the 2017-2019 versions were not worth copying and reinstalling from a now-retired PC, but the 2013 SP1 was needed to support some other software that I use.
I changed one line in your program:
std::cout << std::string(buf) << "\n";
to
std::cout << buf << "\n";
in order to please the older Intel C compiler, but that should not affect the timings.
Here are the results:
S:\LANG\MKL>cblas_test
Intel(R) Math Kernel Library Version 11.1.4 Product Build 20140806 for Intel(R) 64 architecture applications
OpenMP procs/maxThreads = 12 / 12
TIME for Non-Threaded: 10.326
TIME for Threaded: 2.42484
DONE
These results strongly reinforce your findings on Linux.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for checking an earlier version.
Could this issue be escalated to the development team?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reporting this issue. We are forwarding this query to the MKL experts. They will get back to you.
Thanks,
Rahul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tested on both Windows and Linux systems. I was to reproduce the issue on the Windows system, not on the Linux system:
MKL Version | 2020.0.4 | 2021.2.0 |
Windows 10 | ||
Non-Threaded | 17.65 | 16.2 |
Threaded | 19.97 | 18.61 |
Linux Ubuntu 18.4 LTS | ||
Non-Threaded | 14.52 | 14.62 |
Threaded | 13.79 | 13.52 |
I was not able to get access to a cluster to test. I just tested on single-processors systems.
I was able to reproduce the issue on the Windows system, not on the Linux system.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Khang,
I apologize I could not reply sooner. Thank you for looking at this problem. This is still a major issue for our codes.
Without knowing how many threads you were using, I cannot say 100%, but I would say that both your Windows and Linux systems exhibit the issue. For example, if you were using two threads on Linux, then your threaded timing should have dropped to around 8 seconds (if 4 threads, then you should see timings around 4 seconds). Even though you saw a small speedup on Linux instead of a slowdown, the parallel efficiency is terrible.
If you are able, please try to run the same simulation using Intel MKL 2019. I think you would see the threaded timings drop significantly.
Thank you,
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John,
The issue will be addressed in the upcoming release of oneMKL, 2021.3.
Khang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Khang,
That is good news. Thanks for letting us know.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The oneMKL 2021.3 version is now available to download. Can you please try on the latest version and let us know if the issue still persists?
Regards
Rajesh.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Can you please provide an update regarding the issue?
Regards
Rajesh.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Rajesh,
We have put in a request, but we are still waiting on our system administrators to install the latest Intel libraries. As soon as the libraries are installed, I'll verify the issue is fixed.
Best,
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Rajesh,
The issue seems to be fixed in 2021.3. Thanks for your help.
Best,
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for the confirmation!
As this issue has been resolved, we will no longer respond to this thread. If you require any additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.
Have a Good day.
Regards
Rajesh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are the final timings (in seconds) with the fixed code:
cblas_dscal
MKL Version : 2018.0.3 2019.0.4 2020.0.4 2021.1 2021.3
Non-Threaded: 19.59 22.33 17.84 18.38 17.75
Threaded : 1.60 2.09 37.56 37.97 1.52
cblas_dcopy
MKL Version : 2018.0.3 2019.0.4 2020.0.4 2021.1 2021.3
Non-Threaded: 20.55 24.77 23.26 22.82 22.75
Threaded : 1.81 2.27 42.54 42.22 1.91

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page