- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
I noticed a performance drop in cblas_ddot. The program basically just calls cblas_ddot for arrays of size 4095 and 4096 (100 000 times in a loop). I measure the calculation time and compare it to the naive implementation which looks like this:
double naive_ddot(int n, double *x, double *y) {
double sum = 0.0;
for (int i = 0; i < n; i++) {
sum += x[i] * y[i];
}
return sum;
}
The program outputs the following:
MKL, n = 4095, it took 45.2412 ms.
MKL, n = 4096, it took 1030.23 ms.
Naive, n = 4095, it took 153.349 ms.
Naive, n = 4096, it took 155.955 ms.
As you can see, when the size is increased from 4095 to 4096, MKL version becomes 20 times slower. I think this is the size at which it starts using a threading library. This is obtained by using the "Intel threading" (which is the default one). If I run
mkl_set_threading_layer(MKL_THREADING_TBB);
beforehand, the results look more reasonable:
MKL, n = 4095, it took 45.232 ms.
MKL, n = 4096, it took 121.682 ms.
Naive, n = 4095, it took 155.876 ms.
Naive, n = 4096, it took 154.11 ms.
(there's still a significant drop, but not 20x).
I did all of this on the latest available MKL (2025.2.0), Windows 10; the CPU is 13th Gen Intel(R) Core(TM) i7-13850HX 2.10 GHz.
I'm not allowed to attach files anymore, so here's the full text of the .cpp file:
#include <iostream>
#include <iomanip>
#include <chrono>
#include "mkl.h"
double naive_ddot(int n, double *x, double *y) {
double sum = 0.0;
for (int i = 0; i < n; i++) {
sum += x[i] * y[i];
}
return sum;
}
double do_mkl(int n, double *x, double * y) {
auto startTime = std::chrono::steady_clock::now();
double sum = 0.0;
for (int t = 0; t < 100000; t++) {
double result = cblas_ddot(n, x, 1, y, 1);
sum += result;
}
auto endTime = std::chrono::steady_clock::now();
auto duration = std::chrono::duration<double>(endTime - startTime);
std::cout << "MKL, n = " << n << ", it took " << duration.count() * 1000 << " ms." << std::endl;
return sum;
}
double do_naive(int n, double *x, double *y) {
auto startTime = std::chrono::steady_clock::now();
double sum = 0.0;
for (int t = 0; t < 100000; t++) {
double result = naive_ddot(n, x, y);
sum += result;
}
auto endTime = std::chrono::steady_clock::now();
auto duration = std::chrono::duration<double>(endTime - startTime);
std::cout << "Naive, n = " << n << ", it took " << duration.count() * 1000 << " ms." << std::endl;
return sum;
}
int main() {
//mkl_set_threading_layer(MKL_THREADING_TBB);
int n = 4096;
double *x = (double *)malloc(n * sizeof(double));
double *y = (double *)malloc(n * sizeof(double));
for (int i = 0; i < n; i++) {
x[i] = 1.0 + i * 0.1;
y[i] = 3.0 - i * 0.1;
}
double sum = 0.0;
sum += do_mkl(n - 1, x, y);
sum += do_mkl(n, x, y);
sum += do_naive(n - 1, x, y);
sum += do_naive(n, x, y);
std::cout << "Total sum is " << sum << std::endl;
free(x);
free(y);
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You need to be careful about this type of benchmark.
There is significant overhead the first time MKL ( or any OpenMP code) needs to use multiple threads. You should run the MKL benchmarks twice before doing any timing to get all the threads in a ready state.
MKL, n = 4095, it took 94.3632 ms.
MKL, n = 4096, it took 254.286 ms.
Naive, n = 4095, it took 404.669 ms.
Naive, n = 4096, it took 396.192 ms.
MKL, n = 4095, it took 75.0881 ms.
MKL, n = 4096, it took 216.92 ms.
Interestingly, I put #pragma omp parallel for around your loops and got the following timing.
You can see that MKL disables the internal use of threads ( OpenMP disables nested threads by default) so the 4096 vs 4095 timing is basically identical.
MKL, n = 4095, it took 64.1032 ms.
MKL, n = 4096, it took 21.2253 ms.
Naive, n = 4095, it took 52.8654 ms.
Naive, n = 4096, it took 50.9075 ms.
MKL, n = 4095, it took 13.1487 ms.
MKL, n = 4096, it took 13.8093 ms.
But it does look like MKL is invoking the use of OpenMP threads at an N that is probably too low.
Xeon(R) W-2145 CPU @ 3.70GHz, 3696 Mhz, 8 Core(s), 16 Logical Processor(s)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I totally agree with the "first time overhead" argument. However, I didn't notice any improvement when I do it two times in a row. So, if I call do_mkl() and do_naive() twice with the same arguments, I get the following:
c:\git\mkltest\02 dot product>mkltest.exe
MKL, n = 4095, it took 50.1866 ms.
MKL, n = 4096, it took 1131.65 ms.
Naive, n = 4095, it took 154.97 ms.
Naive, n = 4096, it took 153.885 ms.
MKL, n = 4095, it took 41.0654 ms.
MKL, n = 4096, it took 1118.54 ms.
Naive, n = 4095, it took 155.749 ms.
Naive, n = 4096, it took 158.194 ms.
Total sum is -1.81766e+14
which means no significant changes in the running time.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page