<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Performance drop in BLAS dot product function in MKL 2025 in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-drop-in-BLAS-dot-product-function-in-MKL-2025/m-p/1706905#M37281</link>
    <description>&lt;P&gt;I totally agree with the "first time overhead" argument. However, I didn't notice any improvement when I do it two times in a row. So, if I call do_mkl() and do_naive() twice with the same arguments, I get the following:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;c:\git\mkltest\02 dot product&amp;gt;mkltest.exe
MKL, n = 4095, it took 50.1866 ms.
MKL, n = 4096, it took 1131.65 ms.
Naive, n = 4095, it took 154.97 ms.
Naive, n = 4096, it took 153.885 ms.
MKL, n = 4095, it took 41.0654 ms.
MKL, n = 4096, it took 1118.54 ms.
Naive, n = 4095, it took 155.749 ms.
Naive, n = 4096, it took 158.194 ms.
Total sum is -1.81766e+14&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;which means no significant changes in the running time.&lt;/P&gt;</description>
    <pubDate>Fri, 01 Aug 2025 09:13:43 GMT</pubDate>
    <dc:creator>mahalex</dc:creator>
    <dc:date>2025-08-01T09:13:43Z</dc:date>
    <item>
      <title>Performance drop in BLAS dot product function in MKL 2025</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-drop-in-BLAS-dot-product-function-in-MKL-2025/m-p/1704817#M37265</link>
      <description>&lt;P&gt;Hi!&lt;/P&gt;&lt;P&gt;I noticed a performance drop in&amp;nbsp;&lt;SPAN&gt;cblas_ddot. The program basically just calls cblas_ddot for arrays of size 4095 and 4096 (100 000 times in a loop). I measure the calculation time and compare it to the naive implementation which looks like this:&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;double naive_ddot(int n, double *x, double *y) {
	double sum = 0.0;
	for (int i = 0; i &amp;lt; n; i++) {
		sum += x[i] * y[i];
	}
	return sum;
}&lt;/LI-CODE&gt;&lt;P&gt;&lt;SPAN&gt;The program outputs the following:&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;MKL, n = 4095, it took 45.2412 ms.
MKL, n = 4096, it took 1030.23 ms.
Naive, n = 4095, it took 153.349 ms.
Naive, n = 4096, it took 155.955 ms.&lt;/LI-CODE&gt;&lt;P&gt;As you can see, when the size is increased from 4095 to 4096, MKL version becomes 20 times slower. I think this is the size at which it starts using a threading library. This is obtained by using the "Intel threading" (which is the default one). If I run&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;mkl_set_threading_layer(MKL_THREADING_TBB);&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;beforehand, the results look more reasonable:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;MKL, n = 4095, it took 45.232 ms.
MKL, n = 4096, it took 121.682 ms.
Naive, n = 4095, it took 155.876 ms.
Naive, n = 4096, it took 154.11 ms.&lt;/LI-CODE&gt;&lt;P&gt;(there's still a significant drop, but not 20x).&lt;/P&gt;&lt;P&gt;I did all of this on the latest available MKL (2025.2.0), Windows 10; the CPU is&amp;nbsp;13th Gen Intel(R) Core(TM) i7-13850HX 2.10 GHz.&lt;/P&gt;&lt;P&gt;I'm not allowed to attach files anymore, so here's the full text of the .cpp file:&lt;/P&gt;&lt;LI-CODE lang="cpp"&gt;#include &amp;lt;iostream&amp;gt;
#include &amp;lt;iomanip&amp;gt;
#include &amp;lt;chrono&amp;gt;
#include "mkl.h"

double naive_ddot(int n, double *x, double *y) {
	double sum = 0.0;
	for (int i = 0; i &amp;lt; n; i++) {
		sum += x[i] * y[i];
	}

	return sum;
}

double do_mkl(int n, double *x, double * y) {
	auto startTime = std::chrono::steady_clock::now();
	double sum = 0.0;
	for (int t = 0; t &amp;lt; 100000; t++) {
		double result = cblas_ddot(n, x, 1, y, 1);
		sum += result;
	}

	auto endTime = std::chrono::steady_clock::now();
	auto duration = std::chrono::duration&amp;lt;double&amp;gt;(endTime - startTime);
	std::cout &amp;lt;&amp;lt; "MKL, n = " &amp;lt;&amp;lt; n &amp;lt;&amp;lt; ", it took " &amp;lt;&amp;lt; duration.count() * 1000 &amp;lt;&amp;lt; " ms." &amp;lt;&amp;lt; std::endl;
	return sum;
}

double do_naive(int n, double *x, double *y) {
	auto startTime = std::chrono::steady_clock::now();
	double sum = 0.0;
	for (int t = 0; t &amp;lt; 100000; t++) {
		double result = naive_ddot(n, x, y);
		sum += result;
	}

	auto endTime = std::chrono::steady_clock::now();
	auto duration = std::chrono::duration&amp;lt;double&amp;gt;(endTime - startTime);
	std::cout &amp;lt;&amp;lt; "Naive, n = " &amp;lt;&amp;lt; n &amp;lt;&amp;lt; ", it took " &amp;lt;&amp;lt; duration.count() * 1000 &amp;lt;&amp;lt; " ms." &amp;lt;&amp;lt; std::endl;
	return sum;
}

int main() {
	//mkl_set_threading_layer(MKL_THREADING_TBB);
	int n = 4096;
	double *x = (double *)malloc(n * sizeof(double));
	double *y = (double *)malloc(n * sizeof(double));
	for (int i = 0; i &amp;lt; n; i++) {
		x[i] = 1.0 + i * 0.1;
		y[i] = 3.0 - i * 0.1;
	}

	double sum = 0.0;
	sum += do_mkl(n - 1, x, y);
	sum += do_mkl(n, x, y);
	sum += do_naive(n - 1, x, y);
	sum += do_naive(n, x, y);

	std::cout &amp;lt;&amp;lt; "Total sum is " &amp;lt;&amp;lt; sum &amp;lt;&amp;lt; std::endl;

	free(x);
	free(y);
}&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 23 Jul 2025 06:39:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-drop-in-BLAS-dot-product-function-in-MKL-2025/m-p/1704817#M37265</guid>
      <dc:creator>mahalex</dc:creator>
      <dc:date>2025-07-23T06:39:25Z</dc:date>
    </item>
    <item>
      <title>Re: Performance drop in BLAS dot product function in MKL 2025</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-drop-in-BLAS-dot-product-function-in-MKL-2025/m-p/1705546#M37275</link>
      <description>&lt;P&gt;You need to be careful about this type of benchmark.&lt;BR /&gt;&lt;BR /&gt;There is significant overhead the first time MKL ( or any OpenMP code) needs to use multiple threads. You should run the MKL benchmarks twice before doing any timing to get all the threads in a ready state.&lt;BR /&gt;&lt;BR /&gt;MKL, n = 4095, it took 94.3632 ms.&lt;BR /&gt;MKL, n = 4096, it took 254.286 ms.&lt;BR /&gt;Naive, n = 4095, it took 404.669 ms.&lt;BR /&gt;Naive, n = 4096, it took 396.192 ms.&lt;BR /&gt;&lt;STRONG&gt;MKL, n = 4095, it took 75.0881 ms.&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;MKL, n = 4096, it took 216.92 ms.&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;Interestingly, I put #pragma omp parallel for around your loops and got the following timing.&lt;BR /&gt;You can see that MKL disables the internal use of threads ( OpenMP disables nested threads by default) so the 4096 vs 4095 timing is basically identical.&lt;BR /&gt;&lt;BR /&gt;MKL, n = 4095, it took 64.1032 ms.&lt;BR /&gt;MKL, n = 4096, it took 21.2253 ms.&lt;BR /&gt;Naive, n = 4095, it took 52.8654 ms.&lt;BR /&gt;Naive, n = 4096, it took 50.9075 ms.&lt;BR /&gt;&lt;STRONG&gt;MKL, n = 4095, it took 13.1487 ms.&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;MKL, n = 4096, it took 13.8093 ms.&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;But it does look like MKL is invoking the use of OpenMP threads at an N that is probably too low.&lt;BR /&gt;&lt;BR /&gt;Xeon(R) W-2145 CPU @ 3.70GHz, 3696 Mhz, 8 Core(s), 16 Logical Processor(s)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 25 Jul 2025 17:37:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-drop-in-BLAS-dot-product-function-in-MKL-2025/m-p/1705546#M37275</guid>
      <dc:creator>AndrewC2</dc:creator>
      <dc:date>2025-07-25T17:37:03Z</dc:date>
    </item>
    <item>
      <title>Re: Performance drop in BLAS dot product function in MKL 2025</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-drop-in-BLAS-dot-product-function-in-MKL-2025/m-p/1706905#M37281</link>
      <description>&lt;P&gt;I totally agree with the "first time overhead" argument. However, I didn't notice any improvement when I do it two times in a row. So, if I call do_mkl() and do_naive() twice with the same arguments, I get the following:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;c:\git\mkltest\02 dot product&amp;gt;mkltest.exe
MKL, n = 4095, it took 50.1866 ms.
MKL, n = 4096, it took 1131.65 ms.
Naive, n = 4095, it took 154.97 ms.
Naive, n = 4096, it took 153.885 ms.
MKL, n = 4095, it took 41.0654 ms.
MKL, n = 4096, it took 1118.54 ms.
Naive, n = 4095, it took 155.749 ms.
Naive, n = 4096, it took 158.194 ms.
Total sum is -1.81766e+14&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;which means no significant changes in the running time.&lt;/P&gt;</description>
      <pubDate>Fri, 01 Aug 2025 09:13:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-drop-in-BLAS-dot-product-function-in-MKL-2025/m-p/1706905#M37281</guid>
      <dc:creator>mahalex</dc:creator>
      <dc:date>2025-08-01T09:13:43Z</dc:date>
    </item>
  </channel>
</rss>

