Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
7234 Discussions

Performance drop in BLAS dot product function in MKL 2025

mahalex
Novice
3,474 Views

Hi!

I noticed a performance drop in cblas_ddot. The program basically just calls cblas_ddot for arrays of size 4095 and 4096 (100 000 times in a loop). I measure the calculation time and compare it to the naive implementation which looks like this:

double naive_ddot(int n, double *x, double *y) {
	double sum = 0.0;
	for (int i = 0; i < n; i++) {
		sum += x[i] * y[i];
	}
	return sum;
}

The program outputs the following:

MKL, n = 4095, it took 45.2412 ms.
MKL, n = 4096, it took 1030.23 ms.
Naive, n = 4095, it took 153.349 ms.
Naive, n = 4096, it took 155.955 ms.

As you can see, when the size is increased from 4095 to 4096, MKL version becomes 20 times slower. I think this is the size at which it starts using a threading library. This is obtained by using the "Intel threading" (which is the default one). If I run

mkl_set_threading_layer(MKL_THREADING_TBB);

 beforehand, the results look more reasonable:

MKL, n = 4095, it took 45.232 ms.
MKL, n = 4096, it took 121.682 ms.
Naive, n = 4095, it took 155.876 ms.
Naive, n = 4096, it took 154.11 ms.

(there's still a significant drop, but not 20x).

I did all of this on the latest available MKL (2025.2.0), Windows 10; the CPU is 13th Gen Intel(R) Core(TM) i7-13850HX 2.10 GHz.

I'm not allowed to attach files anymore, so here's the full text of the .cpp file:

#include <iostream>
#include <iomanip>
#include <chrono>
#include "mkl.h"

double naive_ddot(int n, double *x, double *y) {
	double sum = 0.0;
	for (int i = 0; i < n; i++) {
		sum += x[i] * y[i];
	}

	return sum;
}

double do_mkl(int n, double *x, double * y) {
	auto startTime = std::chrono::steady_clock::now();
	double sum = 0.0;
	for (int t = 0; t < 100000; t++) {
		double result = cblas_ddot(n, x, 1, y, 1);
		sum += result;
	}

	auto endTime = std::chrono::steady_clock::now();
	auto duration = std::chrono::duration<double>(endTime - startTime);
	std::cout << "MKL, n = " << n << ", it took " << duration.count() * 1000 << " ms." << std::endl;
	return sum;
}

double do_naive(int n, double *x, double *y) {
	auto startTime = std::chrono::steady_clock::now();
	double sum = 0.0;
	for (int t = 0; t < 100000; t++) {
		double result = naive_ddot(n, x, y);
		sum += result;
	}

	auto endTime = std::chrono::steady_clock::now();
	auto duration = std::chrono::duration<double>(endTime - startTime);
	std::cout << "Naive, n = " << n << ", it took " << duration.count() * 1000 << " ms." << std::endl;
	return sum;
}

int main() {
	//mkl_set_threading_layer(MKL_THREADING_TBB);
	int n = 4096;
	double *x = (double *)malloc(n * sizeof(double));
	double *y = (double *)malloc(n * sizeof(double));
	for (int i = 0; i < n; i++) {
		x[i] = 1.0 + i * 0.1;
		y[i] = 3.0 - i * 0.1;
	}

	double sum = 0.0;
	sum += do_mkl(n - 1, x, y);
	sum += do_mkl(n, x, y);
	sum += do_naive(n - 1, x, y);
	sum += do_naive(n, x, y);

	std::cout << "Total sum is " << sum << std::endl;

	free(x);
	free(y);
}

 

2 Replies
AndrewC2
Beginner
3,075 Views

You need to be careful about this type of benchmark.

There is significant overhead the first time MKL ( or any OpenMP code) needs to use multiple threads. You should run the MKL benchmarks twice before doing any timing to get all the threads in a ready state.

MKL, n = 4095, it took 94.3632 ms.
MKL, n = 4096, it took 254.286 ms.
Naive, n = 4095, it took 404.669 ms.
Naive, n = 4096, it took 396.192 ms.
MKL, n = 4095, it took 75.0881 ms.
MKL, n = 4096, it took 216.92 ms.

Interestingly, I put #pragma omp parallel for around your loops and got the following timing.
You can see that MKL disables the internal use of threads ( OpenMP disables nested threads by default) so the 4096 vs 4095 timing is basically identical.

MKL, n = 4095, it took 64.1032 ms.
MKL, n = 4096, it took 21.2253 ms.
Naive, n = 4095, it took 52.8654 ms.
Naive, n = 4096, it took 50.9075 ms.
MKL, n = 4095, it took 13.1487 ms.
MKL, n = 4096, it took 13.8093 ms.

But it does look like MKL is invoking the use of OpenMP threads at an N that is probably too low.

Xeon(R) W-2145 CPU @ 3.70GHz, 3696 Mhz, 8 Core(s), 16 Logical Processor(s)



0 Kudos
mahalex
Novice
1,942 Views

I totally agree with the "first time overhead" argument. However, I didn't notice any improvement when I do it two times in a row. So, if I call do_mkl() and do_naive() twice with the same arguments, I get the following:

c:\git\mkltest\02 dot product>mkltest.exe
MKL, n = 4095, it took 50.1866 ms.
MKL, n = 4096, it took 1131.65 ms.
Naive, n = 4095, it took 154.97 ms.
Naive, n = 4096, it took 153.885 ms.
MKL, n = 4095, it took 41.0654 ms.
MKL, n = 4096, it took 1118.54 ms.
Naive, n = 4095, it took 155.749 ms.
Naive, n = 4096, it took 158.194 ms.
Total sum is -1.81766e+14

 which means no significant changes in the running time.

0 Kudos
Reply