topic Re: Performance drop in BLAS dot product function in MKL 2025 in Intel® oneAPI Math Kernel Library

Performance drop in BLAS dot product function in MKL 2025

mahalex — Wed, 23 Jul 2025 06:39:25 GMT

Hi!

I noticed a performance drop in cblas_ddot. The program basically just calls cblas_ddot for arrays of size 4095 and 4096 (100 000 times in a loop). I measure the calculation time and compare it to the naive implementation which looks like this:

double naive_ddot(int n, double *x, double *y) { double sum = 0.0; for (int i = 0; i < n; i++) { sum += x[i] * y[i]; } return sum; }

The program outputs the following:

MKL, n = 4095, it took 45.2412 ms. MKL, n = 4096, it took 1030.23 ms. Naive, n = 4095, it took 153.349 ms. Naive, n = 4096, it took 155.955 ms.

As you can see, when the size is increased from 4095 to 4096, MKL version becomes 20 times slower. I think this is the size at which it starts using a threading library. This is obtained by using the "Intel threading" (which is the default one). If I run

mkl_set_threading_layer(MKL_THREADING_TBB);

beforehand, the results look more reasonable:

MKL, n = 4095, it took 45.232 ms. MKL, n = 4096, it took 121.682 ms. Naive, n = 4095, it took 155.876 ms. Naive, n = 4096, it took 154.11 ms.

(there's still a significant drop, but not 20x).

I did all of this on the latest available MKL (2025.2.0), Windows 10; the CPU is 13th Gen Intel(R) Core(TM) i7-13850HX 2.10 GHz.

I'm not allowed to attach files anymore, so here's the full text of the .cpp file:

#include <iostream> #include <iomanip> #include <chrono> #include "mkl.h" double naive_ddot(int n, double *x, double *y) { double sum = 0.0; for (int i = 0; i < n; i++) { sum += x[i] * y[i]; } return sum; } double do_mkl(int n, double *x, double * y) { auto startTime = std::chrono::steady_clock::now(); double sum = 0.0; for (int t = 0; t < 100000; t++) { double result = cblas_ddot(n, x, 1, y, 1); sum += result; } auto endTime = std::chrono::steady_clock::now(); auto duration = std::chrono::duration<double>(endTime - startTime); std::cout << "MKL, n = " << n << ", it took " << duration.count() * 1000 << " ms." << std::endl; return sum; } double do_naive(int n, double *x, double *y) { auto startTime = std::chrono::steady_clock::now(); double sum = 0.0; for (int t = 0; t < 100000; t++) { double result = naive_ddot(n, x, y); sum += result; } auto endTime = std::chrono::steady_clock::now(); auto duration = std::chrono::duration<double>(endTime - startTime); std::cout << "Naive, n = " << n << ", it took " << duration.count() * 1000 << " ms." << std::endl; return sum; } int main() { //mkl_set_threading_layer(MKL_THREADING_TBB); int n = 4096; double *x = (double *)malloc(n * sizeof(double)); double *y = (double *)malloc(n * sizeof(double)); for (int i = 0; i < n; i++) { x[i] = 1.0 + i * 0.1; y[i] = 3.0 - i * 0.1; } double sum = 0.0; sum += do_mkl(n - 1, x, y); sum += do_mkl(n, x, y); sum += do_naive(n - 1, x, y); sum += do_naive(n, x, y); std::cout << "Total sum is " << sum << std::endl; free(x); free(y); }

Re: Performance drop in BLAS dot product function in MKL 2025

AndrewC2 — Fri, 25 Jul 2025 17:37:03 GMT

You need to be careful about this type of benchmark.

There is significant overhead the first time MKL ( or any OpenMP code) needs to use multiple threads. You should run the MKL benchmarks twice before doing any timing to get all the threads in a ready state.

MKL, n = 4095, it took 94.3632 ms.
MKL, n = 4096, it took 254.286 ms.
Naive, n = 4095, it took 404.669 ms.
Naive, n = 4096, it took 396.192 ms.
MKL, n = 4095, it took 75.0881 ms.
MKL, n = 4096, it took 216.92 ms.

Interestingly, I put #pragma omp parallel for around your loops and got the following timing.
You can see that MKL disables the internal use of threads ( OpenMP disables nested threads by default) so the 4096 vs 4095 timing is basically identical.

MKL, n = 4095, it took 64.1032 ms.
MKL, n = 4096, it took 21.2253 ms.
Naive, n = 4095, it took 52.8654 ms.
Naive, n = 4096, it took 50.9075 ms.
MKL, n = 4095, it took 13.1487 ms.
MKL, n = 4096, it took 13.8093 ms.

But it does look like MKL is invoking the use of OpenMP threads at an N that is probably too low.

Xeon(R) W-2145 CPU @ 3.70GHz, 3696 Mhz, 8 Core(s), 16 Logical Processor(s)

Re: Performance drop in BLAS dot product function in MKL 2025

mahalex — Fri, 01 Aug 2025 09:13:43 GMT

I totally agree with the "first time overhead" argument. However, I didn't notice any improvement when I do it two times in a row. So, if I call do_mkl() and do_naive() twice with the same arguments, I get the following:

c:\git\mkltest\02 dot product>mkltest.exe MKL, n = 4095, it took 50.1866 ms. MKL, n = 4096, it took 1131.65 ms. Naive, n = 4095, it took 154.97 ms. Naive, n = 4096, it took 153.885 ms. MKL, n = 4095, it took 41.0654 ms. MKL, n = 4096, it took 1118.54 ms. Naive, n = 4095, it took 155.749 ms. Naive, n = 4096, it took 158.194 ms. Total sum is -1.81766e+14

which means no significant changes in the running time.