After investigating a regression in performance on our side after updating MKL from version:
2022.0-Product Build 20211112
2023.1-Product Build 2023030
we discovered the following regression:
When running ~gelss in an omp region with mkl_dynamic turned on, mkl is expected to dynamically modify the threading to fill the machine's physical capabilities. For example, running on a 24 core machine using an omp loop of 12 threads should let mkl_dynamic switch to mkl_get_max_threads=2. However, for certain matrix sizes, ~gelss exits with mkl_get_max_threads still set to 2. Due to mkl_dynamic being on, all subsequent calls will stay stuck running with 2, causing an immense performance loss.
My personal guess is that in the new MKL version a new path was added in gelss that returns early, skipping the expected mkl_dynamic restoring of the 2 threads back to 24. This bug does not reproduce for all matrix sizes, also indicating that this is a specific decision path in gelss that causes the issue, likely a performance improvement that only triggers for specific matrices.
I created and attached a reproducing case that illustrates the issue with both a failing matrix size and a succeeding matrix size. In case it is relevant, the seed size was 2 on my machine to reproduce the exact matrices, but from my tests it is mainly the size that seemed relevant.
For those struggling with the same bug, as the mkl_set_num_threads is unresponsive and uncapable of restoring the bugged mkl_dynamic to a higher value, the following "repair" after calling mkl's ~gelss allows your program to continue with proper mkl multithreading still:
! Issue !$omp parallel call zgelss(...) !$omp end parallel ! Bandaid fix !$omp parallel dummy = mkl_set_num_threads_local(0) !$omp end parallel call mkl_set_num_threads(X)
As per the documentation of mkl_set_num_threads_local, calling it with a value 0 resets the omp thread's mkl threading settings. That seems to be sufficient to reenable the use of mkl_set_num_threads again, restoring the threads to a higher amount X for the rest of the program.
Edit: Clarified code to underline that the issue occurs with a parallel region containing ~gelss and not ~gelss on its own.
I set my environ by running:
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 vs2019 --config="config.txt"
with a config.txt containing:
intelpython=exclude compiler=2023.1.0 mkl=2023.1.0 mpi=2021.9.0
This triggers the following on my machine:
:: NOTICE: Exclude flag found for "intelpython" component. The "intelpython" env\vars.bat script will not be processed by "setvars.bat". :: initializing oneAPI environment... Initializing Visual Studio command-line environment... Visual Studio version 16.11.7 environment configured. "c:\apps\MVS16117\" Visual Studio command-line environment initialized for: 'x64' : advisor -- latest : compiler -- 2023.1.0 : inspector -- latest : mkl -- 2023.1.0 : mpi -- 2021.9.0 : vtune -- latest :: oneAPI environment initialized ::
I'm assuming you are missing the visual studio component.
In that cmd, after setting the environ, the bat-file reproduces on my side.
Could you let us know to machine details as well as the results you are getting for mkl_get_max_threads() and omp_get_max_threads() after calling the proofOfConceptPR8690852() to understand more from our end?
Thanks & Regards,
I can reproduce this on multiple machines. In fact, I have yet to find a machine that does not reproduce.
- One is a Windows
- Two other machines are 12-core VM that I am not allowed to share specific specs for, but one runs Linux and one runs Windows.
As for the behavior of proofOfConceptPR8690852, that depends on how you use it, as illustrated in
- This matrix size of 10 fails the mkl_get_max_threads().eq.omp_get_max_threads()-check
- mkl_get_max_threads() is 1 and omp_get_max_threads() is 24 on the above Windows
- This matrix size of 200 succeeds the mkl_get_max_threads().eq.omp_get_max_threads()-check
- mkl_get_max_threads() is 24 and omp_get_max_threads() is 24 on the above Windows
Maybe it only reproduces under a combination of matrix size and number of threads? Maybe only under a specific instruction set (AVX/...)?
Thanks for the reply.
Yes, the issue seems to be specific to dgelss (LAPACK?) till Intel MKL version 2023.1. Also, Dgemm routine doesn't trigger this behavior.
Could you please try using the latest version of Intel MKL 2023.2 and let us know if you are observing the same behavior?
Thanks & Regards,
This reproduces on MKL 2023.2, yes:
I'm sorry, I'm having the impression you are still somehow not actively reproducing on your end and I am getting rather generic troubleshooting suggestions.
I have quickly asked a group of coworkers who all have different generations of machines to try to recompile and reproduce, and they all, each and every one, managed to reproduce. This puts us on 100% of machines that reproduce this issue. It only takes the execution of two cmd lines to reproduce as well (one of which is simply calling the .bat I provided), so I really struggle to understand where the disconnect in our communication comes from.
Sorry for the inconvenience caused to you.
We are able to reproduce your issue. We are working on your issue internally and we will get back to you soon with an update.
Thanks & Regards,
Thanks for your patience and Apologies for the delay in the response.
As discussed internally, we regret to say that we were unable to reproduce this issue it is only occurring in the Intel MKL version 2023.1, but it was resolved in the latest version the Intel MKL 2023.2.
Could you please let us know if you have any other queries?
Thanks & Regards,