Performance drop when upgraded from MKL 2017 to MKL 2025

hatdal · ‎07-01-2025

I recently upgraded a multithreaded computational library to use MKL 2025.0 instead of version 2017.3.210. Unfortunately, I have a notable performance drop in overall computations. We do use a lot blas and vml functions on large float arrays.

We run on windows Server and on both AVX-512 and AVX2 Intel Xeon processors. Processors have 2 NUMA nodes. We usually create threads as much as the number of cores (tried both on one NUMA node and on the 2 NUMA nodes by modifying the thread affinity). Please note we use MKL sequential by linking with the sequential DLL.

I cannot find the change in MKL release which may causes this behavior change. Please could you advice?

Fengrui · ‎07-01-2025

Could you please share a sample code that can reproduce the performance drop?

hatdal · ‎07-02-2025

Hi Fengrui,

Our library is a huge C++ library and does a lot of computation for market finance (prices, indicators..) and along many steps, and honestly for the moment I didn't find out where exactly we have that drop of performance even I tried to use Intel tools. It seems spread on all computation steps. However, I remarked when decreasing the thread number, performances get better. Of course, I do not call mkl_set_num_threads or any other MKL service routine since I use the sequential MKL and I see in MKL traces instructions set (AVX..) are properly detected. The only difference I remarked between 2017 and 2025 version, is that for VML functions, MKL 2017 loads mkl_vml_def.dll vs MKL 2025 loads mkl_vml_avx512.dll (I am on avx512 processor). Last thing, we do a lot of memory aligned-on-128 allocation/deallocation by using mkl_malloc. All ideas of investigation are welcome. Thanks

Fengrui · ‎07-02-2025

I would recommend to turn on the verbose mode of oneMKL, that is to run with env variable MKL_VERBOSE=1, to see if there is noticeable performance drop of BLAS functions. VML functions don't support it though. It might need to create testing codes with real-case data for those VML functions

mahalex · ‎07-22-2025

This looks related: https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-drop-in-BLAS-dot-product-function-in-MKL-2025/m-p/1704817

Yang76 · ‎07-25-2025

I've had a similar experience myself. In my tests, I saved 100 matrices of size 500x500 and measured how long it took to invert them (calling dgetrf followed by dgetri). The older version (2015) consistently completed in about 1 second, while the newer version (2025) was slower and less consistent, with times varying between 1 and 3 seconds.

Yang76 · ‎07-25-2025

Even more concerning is the total processor time, which was just 2 seconds in the old version but almost always exceeds 10 seconds in the new version.

Yang76 · ‎07-25-2025

I found that, if I disable OMP multithreading by setting the environment variable OMP_NUM_THREADS to 1, the new version becomes consistent and performs slightly better than the old version.

Now, the question is: how can I determine when to enable or disable OMP multithreading?

AndrewC2 · ‎07-25-2025

You need to run any benchmarks twice and only use the timing of the second run. There is significant overhead in setting up OpenMP threading the first time.

Yang76 · ‎07-25-2025

This is exactly what I did. All the results I reported are from the second run, after inverting some random matrices.

After some experiments and with some help from chatgpt, I think I have a better understanding of the situation. There are two main factors at play here:

1. Matrix size. Matrix of size 500 x 500 is actually quite small to fully exploit parallelization

2. Default number of threads.

I am on a virtual machine with 4 sockets/8 virtual processors. The new version uses 8 threads by default (as confirmed by mkl verbose mode), whereas the old version seems to use only 2 threads (inferred from the ratio of processor time to clock time). Using fewer threads actually helps when the matrices are small.

When I set the number of threads explicitly, or run the same experiment on a physical computer with 1 socket/10 cores, the new version performs consistently and is slightly better.

AndrewC2 · ‎07-25-2025

OK, well that makes sense. VM's are tricky beasts to be running benchmarks on.
So the summary is that you did not find any performance regressions with the new version of MKL which agrees with my experience as well.

hatdal · ‎07-28-2025

Hi all, this topic is related to sequential MKL, means my app is self-multithreaded and I do not use MKL OMP. I manage threads myself.

In this case, disabling OMP multithreading probably have no effect. But maybe I’m wrong.

I used Intel VTune to compare to old MKL performance to try to find out where performance regressions are, but seems spread on all calculations.

Also maybe issue with MKL_malloc, since is locking..

One other thing, I have hyperthreading activated, should this have that huge impact on performance.

@Fengrui do have any advice?

Reminder, the only change I did on my computational library is moving from MKL 2017 to MKL 2025.

Regards

Yang76 · ‎08-04-2025

I believe mkl by default is multithreaded. For mkl 2025, you can set MKL_VERBOSE env variable to 1 and see the detailed info in the console ouput (this feature is not available in mkl 2017).

hatdal · ‎08-05-2025

I link my C++ soft with sequential MKL.. which means MKL not threaded..