<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic If you used Intel C++ both a in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dscal-metdod/m-p/963520#M16043</link>
    <description>&lt;P&gt;If you used Intel C++ both a memset and a normal for loop performing the same operation would be replaced by the same Intel optimized function from the Intel C++ library. &amp;nbsp;That would use streaming stores if it found the array large enough. &amp;nbsp;I don't know if the Microsoft memset would do so. &amp;nbsp;It appears that you have been able to defeat the normal optimization of your compiler. &amp;nbsp;&lt;/P&gt;

&lt;P&gt;dscal doesn't do the same operation as memset. &amp;nbsp;It has to be expected to take longer, even after the MKL initialization is complete, as it has to perform twice as many memory operations, besides the arithmetic.&lt;/P&gt;

&lt;P&gt;Successful threading of the operation would speed it up (not by 2x) on a dual CPU platform, maybe not on a single CPU.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 21 Jan 2014 14:09:00 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2014-01-21T14:09:00Z</dc:date>
    <item>
      <title>cblas_dscal metdod</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dscal-metdod/m-p/963518#M16041</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;

&lt;P&gt;I implemented below code segment to compare "memset", "for loop" and "cblas_dscal (Intel MKL)".&amp;nbsp;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;The test results are as:&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;memset: G0 = 0.000000 6954 ns&lt;/P&gt;

&lt;P&gt;for: G0 = 1.000000 38741 ns&lt;/P&gt;

&lt;P&gt;MKL: G0 = 2.000000 12911294 ns&lt;BR /&gt;
	MKL: G0 = 4.000000 13907 ns&lt;BR /&gt;
	MKL: G0 = 8.000000 10264 ns&lt;BR /&gt;
	MKL: G0 = 16.000000 10265 ns&lt;/P&gt;

&lt;P&gt;As seen above first MKL&amp;nbsp;cblas_dscal calculation is&amp;nbsp;12911294 ns but after first iteration it is decreased to&amp;nbsp;13907 ns.&lt;/P&gt;

&lt;P&gt;I want to learn what makes first&amp;nbsp;cblas_dscal call to take such high amount of time? And how can I make this call to take less time?&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;My code is:&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;#include "stdafx.h"&lt;BR /&gt;
	#include "mkl_dfti.h"&lt;BR /&gt;
	#include "mkl.h"&lt;BR /&gt;
	#include &amp;lt;windows.h&amp;gt;&lt;/P&gt;

&lt;P&gt;static LARGE_INTEGER freq;&lt;/P&gt;

&lt;P&gt;DWORD getTimeInNanoSec()&lt;BR /&gt;
	{&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;LARGE_INTEGER counterCurrent;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;QueryPerformanceCounter(&amp;amp;counterCurrent); &amp;nbsp;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp;//frequency gives the tick number in 1 second, therefore we need to multiply with 1000000000 to get nanoseconds&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;return counterCurrent.QuadPart * (1000000000) / freq.QuadPart;&lt;BR /&gt;
	}&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	int _tmain(int argc, _TCHAR* argv[])&lt;BR /&gt;
	{&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;MKL_Complex16 *xxx;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;const int N = 10240;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;DWORD start , end;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;double* G= NULL;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;int idx = 0;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;//set timer counters frq&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; QueryPerformanceFrequency(&amp;amp;freq); //ticks per secon&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;G = (double*)malloc(N*sizeof(double));//not used&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;//1. memset&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; start = getTimeInNanoSec();&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; memset(G,0x0,N*sizeof(double));&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; end = getTimeInNanoSec();&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; printf("memset: G0:%f %llu ns\n\n", G[100], end - start);&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; //2. for&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; start = getTimeInNanoSec();&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; for(idx = 0;idx &amp;lt; N ;idx = idx=idx+5)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; {&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;G[idx] = 1.0;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; G[idx+1] = 1.0;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; G[idx+2] = 1.0;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; G[idx+3] = 1.0;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; G[idx+4] = 1.0;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; }&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; end = getTimeInNanoSec();&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; printf("for: G0:%f %llu ns\n\n", G[100], end - start);&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; //3. mkl blas&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; for(idx = 0;idx &amp;lt;4 ;idx++)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;{&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;start = getTimeInNanoSec();&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;cblas_dscal(N, 2.0 , G , 1 );&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;end = getTimeInNanoSec();&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;printf("MKL: G0:%f %llu ns\n\n", G[100], end - start);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;}&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;getchar();&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; return 0;&lt;BR /&gt;
	}&lt;/P&gt;</description>
      <pubDate>Mon, 20 Jan 2014 07:50:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dscal-metdod/m-p/963518#M16041</guid>
      <dc:creator>Selin_S_</dc:creator>
      <dc:date>2014-01-20T07:50:21Z</dc:date>
    </item>
    <item>
      <title>Hello,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dscal-metdod/m-p/963519#M16042</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;

&lt;P&gt;First call to an MKL routine performs MKL and OpenMP initialization. The initialization overheads may be significant especially&amp;nbsp;for Level1 functions, which have relatively small execution times. One can do a warm-up MKL function call to exclude the initialization costs during the performance measurements (essentially ignoring the time required for your first iteration).&lt;/P&gt;

&lt;P&gt;Unfortunately, there is not much one could do to avoid the initialization costs. If an MKL&amp;nbsp;function is multithreaded, one can try running it sequentially to minimize the OpenMP initialization costs. However, I think MKL dscal is not threaded currently.&lt;/P&gt;

&lt;P&gt;Thank you!&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jan 2014 09:46:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dscal-metdod/m-p/963519#M16042</guid>
      <dc:creator>Murat_G_Intel</dc:creator>
      <dc:date>2014-01-21T09:46:00Z</dc:date>
    </item>
    <item>
      <title>If you used Intel C++ both a</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dscal-metdod/m-p/963520#M16043</link>
      <description>&lt;P&gt;If you used Intel C++ both a memset and a normal for loop performing the same operation would be replaced by the same Intel optimized function from the Intel C++ library. &amp;nbsp;That would use streaming stores if it found the array large enough. &amp;nbsp;I don't know if the Microsoft memset would do so. &amp;nbsp;It appears that you have been able to defeat the normal optimization of your compiler. &amp;nbsp;&lt;/P&gt;

&lt;P&gt;dscal doesn't do the same operation as memset. &amp;nbsp;It has to be expected to take longer, even after the MKL initialization is complete, as it has to perform twice as many memory operations, besides the arithmetic.&lt;/P&gt;

&lt;P&gt;Successful threading of the operation would speed it up (not by 2x) on a dual CPU platform, maybe not on a single CPU.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jan 2014 14:09:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dscal-metdod/m-p/963520#M16043</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-01-21T14:09:00Z</dc:date>
    </item>
  </channel>
</rss>

