Performance drop in BLAS dot product function in MKL 2025

mahalex · ‎07-22-2025

Hi!

I noticed a performance drop in cblas_ddot. The program basically just calls cblas_ddot for arrays of size 4095 and 4096 (100 000 times in a loop). I measure the calculation time and compare it to the naive implementation which looks like this:

double naive_ddot(int n, double *x, double *y) {
	double sum = 0.0;
	for (int i = 0; i < n; i++) {
		sum += x[i] * y[i];
	}
	return sum;
}

The program outputs the following:

MKL, n = 4095, it took 45.2412 ms.
MKL, n = 4096, it took 1030.23 ms.
Naive, n = 4095, it took 153.349 ms.
Naive, n = 4096, it took 155.955 ms.

As you can see, when the size is increased from 4095 to 4096, MKL version becomes 20 times slower. I think this is the size at which it starts using a threading library. This is obtained by using the "Intel threading" (which is the default one). If I run

mkl_set_threading_layer(MKL_THREADING_TBB);

beforehand, the results look more reasonable:

MKL, n = 4095, it took 45.232 ms.
MKL, n = 4096, it took 121.682 ms.
Naive, n = 4095, it took 155.876 ms.
Naive, n = 4096, it took 154.11 ms.

(there's still a significant drop, but not 20x).

I did all of this on the latest available MKL (2025.2.0), Windows 10; the CPU is 13th Gen Intel(R) Core(TM) i7-13850HX 2.10 GHz.

I'm not allowed to attach files anymore, so here's the full text of the .cpp file:

#include <iostream>
#include <iomanip>
#include <chrono>
#include "mkl.h"

double naive_ddot(int n, double *x, double *y) {
	double sum = 0.0;
	for (int i = 0; i < n; i++) {
		sum += x[i] * y[i];
	}

	return sum;
}

double do_mkl(int n, double *x, double * y) {
	auto startTime = std::chrono::steady_clock::now();
	double sum = 0.0;
	for (int t = 0; t < 100000; t++) {
		double result = cblas_ddot(n, x, 1, y, 1);
		sum += result;
	}

	auto endTime = std::chrono::steady_clock::now();
	auto duration = std::chrono::duration<double>(endTime - startTime);
	std::cout << "MKL, n = " << n << ", it took " << duration.count() * 1000 << " ms." << std::endl;
	return sum;
}

double do_naive(int n, double *x, double *y) {
	auto startTime = std::chrono::steady_clock::now();
	double sum = 0.0;
	for (int t = 0; t < 100000; t++) {
		double result = naive_ddot(n, x, y);
		sum += result;
	}

	auto endTime = std::chrono::steady_clock::now();
	auto duration = std::chrono::duration<double>(endTime - startTime);
	std::cout << "Naive, n = " << n << ", it took " << duration.count() * 1000 << " ms." << std::endl;
	return sum;
}

int main() {
	//mkl_set_threading_layer(MKL_THREADING_TBB);
	int n = 4096;
	double *x = (double *)malloc(n * sizeof(double));
	double *y = (double *)malloc(n * sizeof(double));
	for (int i = 0; i < n; i++) {
		x[i] = 1.0 + i * 0.1;
		y[i] = 3.0 - i * 0.1;
	}

	double sum = 0.0;
	sum += do_mkl(n - 1, x, y);
	sum += do_mkl(n, x, y);
	sum += do_naive(n - 1, x, y);
	sum += do_naive(n, x, y);

	std::cout << "Total sum is " << sum << std::endl;

	free(x);
	free(y);
}

AndrewC2 · ‎07-25-2025

You need to be careful about this type of benchmark.

There is significant overhead the first time MKL ( or any OpenMP code) needs to use multiple threads. You should run the MKL benchmarks twice before doing any timing to get all the threads in a ready state.

MKL, n = 4095, it took 94.3632 ms.
MKL, n = 4096, it took 254.286 ms.
Naive, n = 4095, it took 404.669 ms.
Naive, n = 4096, it took 396.192 ms.
MKL, n = 4095, it took 75.0881 ms.
MKL, n = 4096, it took 216.92 ms.

Interestingly, I put #pragma omp parallel for around your loops and got the following timing.
You can see that MKL disables the internal use of threads ( OpenMP disables nested threads by default) so the 4096 vs 4095 timing is basically identical.

MKL, n = 4095, it took 64.1032 ms.
MKL, n = 4096, it took 21.2253 ms.
Naive, n = 4095, it took 52.8654 ms.
Naive, n = 4096, it took 50.9075 ms.
MKL, n = 4095, it took 13.1487 ms.
MKL, n = 4096, it took 13.8093 ms.

But it does look like MKL is invoking the use of OpenMP threads at an N that is probably too low.

Xeon(R) W-2145 CPU @ 3.70GHz, 3696 Mhz, 8 Core(s), 16 Logical Processor(s)

mahalex · ‎08-01-2025

I totally agree with the "first time overhead" argument. However, I didn't notice any improvement when I do it two times in a row. So, if I call do_mkl() and do_naive() twice with the same arguments, I get the following:

c:\git\mkltest\02 dot product>mkltest.exe
MKL, n = 4095, it took 50.1866 ms.
MKL, n = 4096, it took 1131.65 ms.
Naive, n = 4095, it took 154.97 ms.
Naive, n = 4096, it took 153.885 ms.
MKL, n = 4095, it took 41.0654 ms.
MKL, n = 4096, it took 1118.54 ms.
Naive, n = 4095, it took 155.749 ms.
Naive, n = 4096, it took 158.194 ms.
Total sum is -1.81766e+14

which means no significant changes in the running time.