<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic VML performance very slow in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822701#M4912</link>
    <description>Hi mklvml,&lt;BR /&gt;&lt;BR /&gt;Let me try to provide insights on the testcase execution efficiency.&lt;BR /&gt;&lt;BR /&gt;The first fact that I've noticed is that your interested in measuring performance of the &lt;BR /&gt;f(i)=a(i)*b(i)+c(i)*d(i)+a(i), where i goes from 1 to 50M. You're interested in double precision. I couldn't find reference to the compiler you use for performance evaluation; but I found that you use MKL 10.2. You also refer to Intel Core 2 Duo processor. Please correct me if I misinterpreted you.&lt;BR /&gt;&lt;BR /&gt;First, I should say that modern Intel processors can execute multiply and add instructions within the same processor cycle. That is, c(i)*d(i)+a(i) can be executed in one cycle. As soon as the result (let it be tmp(i)) of this operation is ready the processor canissue a(i)*b(i)+tmp(i) within the same cycle.&lt;BR /&gt;&lt;BR /&gt;Next thing, if you use the modern optimizing compiler for x86 such as Intel Fortran or C++ compiler the compiler is capable ofvectorizing the code by using vectorSSE2 instructions. This will result in the fact that the processor will exectue two consecutive loop iterations in parallel, e.g. i-th and (i+1)-th. Also modern compilers can unroll the loop and schedule instruction in such a way that latency of the computation of tmp(i) is hidden by other computations (from other loop iterations). &lt;BR /&gt;&lt;BR /&gt;The point is if you use smart enough compiler then a(i)*b(i)+c(i)*d(i)+a(i) is not being executed literally as it is written.&lt;BR /&gt;&lt;BR /&gt;Let us have a look at what happens if you call vdAdd from Intel MKL. MKL vector add executes add operation on vector elements. That simple fact means that by calling VML add you underutilize CPU multiply unit for a long time. On the next step you call vector multiply, which underutilizes CPU add unit. I would recommend you to look for other MKL primitives that better balance use of add and multiply CPU units, e.g. dot product functions in MKL or similar one.&lt;BR /&gt;&lt;BR /&gt;A few notes about vector primitives threading efficiency. Let's have a look at VML Performance and Accuracy charts available with MKL documentation&lt;BR /&gt;&lt;A href="http://software.intel.com/sites/products/documentation/hpc/mkl/vml3/functions/mul.html" target="_blank"&gt;http://software.intel.com/sites/products/documentation/hpc/mkl/vml3/functions/mul.html&lt;/A&gt;&lt;BR /&gt;&lt;A href="http://software.intel.com/sites/products/documentation/hpc/mkl/vml3/functions/add.html" target="_blank"&gt;http://software.intel.com/sites/products/documentation/hpc/mkl/vml3/functions/add.html&lt;/A&gt;&lt;BR /&gt;You can notice a few interesting facts out of that data&lt;BR /&gt;1) Threading adds non-negligible overheads to the function execution time, especially noticable on shorter vector lengths&lt;BR /&gt;2) Due to those overheads threading itself makes sense when the vector size is big enough which is conflictingwith the objective to use shorter vectorsto fit into L2 cache.&lt;BR /&gt;&lt;BR /&gt;Please notice again, modern CPU can issue 2 adds and 2 mulsevery cycle. This is really tiny performance primitives. Threading is not being for free; this is typically quite expensive. Typically people tend todo threading on the highest possible level (application level).SoI'm basically not surprised that you're not seeing theperformance gains.&lt;BR /&gt;&lt;BR /&gt;Please don't hesitate to contact me if you need more details,&lt;BR /&gt;Regards,&lt;BR /&gt;Sergey</description>
    <pubDate>Fri, 21 Jan 2011 06:49:45 GMT</pubDate>
    <dc:creator>Sergey_M_Intel2</dc:creator>
    <dc:date>2011-01-21T06:49:45Z</dc:date>
    <item>
      <title>VML performance very slow</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822696#M4907</link>
      <description>&lt;P&gt;I wrote this small subroutine that compares simple vector mathematical functions, performed either with a loop:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;f(i) = a(i) + b(i)&lt;BR /&gt;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;or direct:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;f = a + b&lt;BR /&gt;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;or using Intel MKL VML:&lt;/P&gt;&lt;P&gt;vdAdd(n,a,b,f)&lt;/P&gt;&lt;P&gt;The timing results for n=50000000 are:&lt;/P&gt;&lt;P&gt;VML 0.9 sec &lt;/P&gt;&lt;P&gt;direct 0.4 &lt;/P&gt;&lt;P&gt;loop 0.4&lt;/P&gt;&lt;P&gt;And I dont understand, why VML takes twice as long as the other methods! (Loop is sometimes faster than direct)&lt;/P&gt;&lt;P&gt;I used threaded MKL with 2 or 1 thread on Intel Core 2 Duo, but the result stays the same.&lt;/P&gt;&lt;P&gt;Flags: /O3 /MT /Qopenmp /heap-arrays0&lt;/P&gt;&lt;P&gt;Subroutine can be found under&lt;A href="http://paste.ideaslabs.com/show/L6dVLdAOIf" rel="nofollow"&gt;http://paste.ideaslabs.com/show/L6dVLdAOIf&lt;/A&gt;and called via&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;program test&lt;BR /&gt;&lt;BR /&gt;  use vmltests&lt;BR /&gt;  implicit none&lt;BR /&gt;&lt;BR /&gt;  call vmlTest()&lt;BR /&gt;&lt;BR /&gt;end program&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Mon, 17 Jan 2011 17:19:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822696#M4907</guid>
      <dc:creator>mklvml</dc:creator>
      <dc:date>2011-01-17T17:19:31Z</dc:date>
    </item>
    <item>
      <title>VML performance very slow</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822697#M4908</link>
      <description>What version of MKL You are using?</description>
      <pubDate>Mon, 17 Jan 2011 20:41:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822697#M4908</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2011-01-17T20:41:08Z</dc:date>
    </item>
    <item>
      <title>VML performance very slow</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822698#M4909</link>
      <description>&lt;DIV id="_mcePaste"&gt;Vector Math Functions work best when data is in L2 cache. n=50000000 is way out of L2 cache.&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;Your example code is not just a(i)+b(i), it is f(i)=a(i)*b(i)+c(i)*d(i)+a(i), which you replace by several MKL VML calls.MKL VML walks through all this memory for each call, while compiler optimized code groups computation and walks through this memory only once.&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;In order to overcome this limitation you may apply common optimization technique named blocking:&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;Do i=1,50000&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; call vdMul(1000,a(i*1000),b(i*1000),e(i*1000))&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; call vdMul(1000,c(i*1000),d(i*1000),f(i*1000))&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; call vdAdd(1000,f(i*1000),e(i*1000),f(i*1000))&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; call vdAdd(1000,f(i*1000),a(i*1000),f(i*1000))&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;End do&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;Each block will be within L2 cache.&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;When/if you try more complex functions you will come to yet another effect: by default compiler will use less accurate functions than MKL. Use vmlSetMode function to set MKL VML accuracy to the same level:&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;mode=VML_LA&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;mode=VMLSETMODE(mode)&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;And yes, which MKL version are you using?&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 18 Jan 2011 08:23:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822698#M4909</guid>
      <dc:creator>Ilya_B_Intel</dc:creator>
      <dc:date>2011-01-18T08:23:48Z</dc:date>
    </item>
    <item>
      <title>VML performance very slow</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822699#M4910</link>
      <description>MKL 10.2.&lt;BR /&gt;&lt;BR /&gt;Thank you for these insights!</description>
      <pubDate>Thu, 20 Jan 2011 15:01:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822699#M4910</guid>
      <dc:creator>mklvml</dc:creator>
      <dc:date>2011-01-20T15:01:50Z</dc:date>
    </item>
    <item>
      <title>VML performance very slow</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822700#M4911</link>
      <description>The L2-problem does not explain, why the MKL functions do not scale at all an a dual-core precessor!&lt;BR /&gt;&lt;BR /&gt;If MKL Threads are set to 2, the CPU simply doubles!</description>
      <pubDate>Thu, 20 Jan 2011 15:05:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822700#M4911</guid>
      <dc:creator>mklvml</dc:creator>
      <dc:date>2011-01-20T15:05:07Z</dc:date>
    </item>
    <item>
      <title>VML performance very slow</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822701#M4912</link>
      <description>Hi mklvml,&lt;BR /&gt;&lt;BR /&gt;Let me try to provide insights on the testcase execution efficiency.&lt;BR /&gt;&lt;BR /&gt;The first fact that I've noticed is that your interested in measuring performance of the &lt;BR /&gt;f(i)=a(i)*b(i)+c(i)*d(i)+a(i), where i goes from 1 to 50M. You're interested in double precision. I couldn't find reference to the compiler you use for performance evaluation; but I found that you use MKL 10.2. You also refer to Intel Core 2 Duo processor. Please correct me if I misinterpreted you.&lt;BR /&gt;&lt;BR /&gt;First, I should say that modern Intel processors can execute multiply and add instructions within the same processor cycle. That is, c(i)*d(i)+a(i) can be executed in one cycle. As soon as the result (let it be tmp(i)) of this operation is ready the processor canissue a(i)*b(i)+tmp(i) within the same cycle.&lt;BR /&gt;&lt;BR /&gt;Next thing, if you use the modern optimizing compiler for x86 such as Intel Fortran or C++ compiler the compiler is capable ofvectorizing the code by using vectorSSE2 instructions. This will result in the fact that the processor will exectue two consecutive loop iterations in parallel, e.g. i-th and (i+1)-th. Also modern compilers can unroll the loop and schedule instruction in such a way that latency of the computation of tmp(i) is hidden by other computations (from other loop iterations). &lt;BR /&gt;&lt;BR /&gt;The point is if you use smart enough compiler then a(i)*b(i)+c(i)*d(i)+a(i) is not being executed literally as it is written.&lt;BR /&gt;&lt;BR /&gt;Let us have a look at what happens if you call vdAdd from Intel MKL. MKL vector add executes add operation on vector elements. That simple fact means that by calling VML add you underutilize CPU multiply unit for a long time. On the next step you call vector multiply, which underutilizes CPU add unit. I would recommend you to look for other MKL primitives that better balance use of add and multiply CPU units, e.g. dot product functions in MKL or similar one.&lt;BR /&gt;&lt;BR /&gt;A few notes about vector primitives threading efficiency. Let's have a look at VML Performance and Accuracy charts available with MKL documentation&lt;BR /&gt;&lt;A href="http://software.intel.com/sites/products/documentation/hpc/mkl/vml3/functions/mul.html" target="_blank"&gt;http://software.intel.com/sites/products/documentation/hpc/mkl/vml3/functions/mul.html&lt;/A&gt;&lt;BR /&gt;&lt;A href="http://software.intel.com/sites/products/documentation/hpc/mkl/vml3/functions/add.html" target="_blank"&gt;http://software.intel.com/sites/products/documentation/hpc/mkl/vml3/functions/add.html&lt;/A&gt;&lt;BR /&gt;You can notice a few interesting facts out of that data&lt;BR /&gt;1) Threading adds non-negligible overheads to the function execution time, especially noticable on shorter vector lengths&lt;BR /&gt;2) Due to those overheads threading itself makes sense when the vector size is big enough which is conflictingwith the objective to use shorter vectorsto fit into L2 cache.&lt;BR /&gt;&lt;BR /&gt;Please notice again, modern CPU can issue 2 adds and 2 mulsevery cycle. This is really tiny performance primitives. Threading is not being for free; this is typically quite expensive. Typically people tend todo threading on the highest possible level (application level).SoI'm basically not surprised that you're not seeing theperformance gains.&lt;BR /&gt;&lt;BR /&gt;Please don't hesitate to contact me if you need more details,&lt;BR /&gt;Regards,&lt;BR /&gt;Sergey</description>
      <pubDate>Fri, 21 Jan 2011 06:49:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822701#M4912</guid>
      <dc:creator>Sergey_M_Intel2</dc:creator>
      <dc:date>2011-01-21T06:49:45Z</dc:date>
    </item>
    <item>
      <title>VML performance very slow</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822702#M4913</link>
      <description>&lt;DIV&gt;&lt;/DIV&gt;Additional inputs.&lt;DIV&gt;&lt;SPAN style="font-family: Verdana, Arial, Helvetica, sans-serif;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN style="font-family: Verdana, Arial, Helvetica, sans-serif;"&gt;I took your testcase and reduced it to original question: what if we take only one addition. This is the case, where no mul+add pairing is possible and MKL should give similar results to compiler code.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN style="font-family: Verdana, Arial, Helvetica, sans-serif;"&gt;&lt;SPAN style="font-family: Verdana, Arial, Helvetica, sans-serif;"&gt;&lt;DIV style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;"&gt;   &lt;/DIV&gt;&lt;DIV style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;"&gt;   Call StartTime(time(:,1))&lt;/DIV&gt;&lt;DIV style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;"&gt;     call vdAdd(n,c,d,f)&lt;/DIV&gt;&lt;DIV style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;"&gt;   Call StopTime(time(:,1))&lt;/DIV&gt;&lt;DIV style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;"&gt;  &lt;/DIV&gt;&lt;DIV style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;"&gt;   Call StartTime(time(:,2))&lt;/DIV&gt;&lt;DIV style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;"&gt;     f(i)=c(i)+d(i)&lt;/DIV&gt;&lt;DIV style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;"&gt;   End do&lt;/DIV&gt;&lt;DIV style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;"&gt;   Call StopTime(time(:,2))&lt;/DIV&gt;&lt;DIV style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;"&gt;&lt;BR style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;" /&gt;&lt;/DIV&gt;&lt;DIV style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;"&gt;&lt;SPAN style="font-family: Verdana, Arial, Helvetica, sans-serif; padding: 0px; margin: 0px;"&gt;&lt;DIV style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;"&gt;   Call StartTime(time(:,3))&lt;/DIV&gt;&lt;DIV style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;"&gt;     f=c+d&lt;/DIV&gt;&lt;DIV style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;"&gt;   Call StopTime(time(:,3))&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;Main finding:&lt;/DIV&gt;&lt;DIV&gt;The slowest time is not for VML call, but for one, which is measured the first. So, if you will measure direct call first, it will be the slowest.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;The reasons:&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;1). Your timing routine has a first call initialization, so we are measuring initialization of timing routine and not computations. In order to remove that effect, we make a fake timing measurement on the first place:&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV id="_mcePaste"&gt;   Call StartTime(time(:,4))&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;   Call StopTime(time(:,4))&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;2) Your output array is allocated but not yet touched, this brings issues. For fair comparison we will put something in it:&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;   &lt;SPAN style="text-decoration: underline;"&gt;call random_number(f)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;   call random_number(c)&lt;/DIV&gt;&lt;DIV&gt;   call random_number(d)&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;3) I am not aware of your memory limits, but 3 such double precision arrays with 50M elements is 1.2G.&lt;/DIV&gt;&lt;DIV&gt;In our example we remove a,b,e arrays allocation.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Now we run new example and timings are very close:&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;VML   0.3679440   0.5000000&lt;/DIV&gt;&lt;DIV&gt;Loop  0.3649440   0.5000000&lt;/DIV&gt;&lt;DIV&gt;direct 0.3659451   0.5000000&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt; Another question: threading.&lt;/DIV&gt;&lt;DIV&gt;vdAdd and vdMul functions are threaded starting MKL 10.3&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 21 Jan 2011 10:57:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822702#M4913</guid>
      <dc:creator>Ilya_B_Intel</dc:creator>
      <dc:date>2011-01-21T10:57:23Z</dc:date>
    </item>
    <item>
      <title>Hello ,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822703#M4914</link>
      <description>&lt;P&gt;Hello ,&lt;/P&gt;

&lt;P&gt;I am facing a kind of same problem as mklvml mentioned above .I want to implement Intel vml functions in to my subroutines which are written in fortran90. I wrote a code to test timing difference using Multiplication as the operation on the arrays that are generated by using random number generator. The normal array size of my subroutines are 10^6 . Results of my code are mentioned below .&lt;/P&gt;

&lt;P&gt;I am asking this thing again as I cannot find the code attached by mklvml and its difficult to follow the comments without having a look at code. And also in my case I want to be sure about the timings improvement before applying it to my subroutines .&lt;/P&gt;

&lt;P&gt;So please do share your comments on it.&lt;/P&gt;

&lt;P&gt;The &lt;STRONG&gt;outpu&lt;/STRONG&gt;t that I get is :&lt;/P&gt;

&lt;P&gt;t3 - t2(Do loop) = 8 sec&lt;/P&gt;

&lt;P&gt;t4 - t3 (Vml Function )= 49 sec&lt;/P&gt;

&lt;P&gt;As per the output it seems that do loops are faster than VML Function .&lt;/P&gt;</description>
      <pubDate>Fri, 24 Oct 2014 14:42:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822703#M4914</guid>
      <dc:creator>Malav_S_</dc:creator>
      <dc:date>2014-10-24T14:42:47Z</dc:date>
    </item>
    <item>
      <title>-msse4.1 appears to improve</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822704#M4915</link>
      <description>&lt;P&gt;-msse4.1 appears to improve performance on my Westmere, though I don't know why.&lt;/P&gt;

&lt;P&gt;The vml runs fastest at about 4 threads (out of the default 24), while the in-line code is running 1 thread with nontemporal store.&amp;nbsp; Evidently, the case is limited by memory bandwidth and cache issues.&lt;/P&gt;</description>
      <pubDate>Fri, 24 Oct 2014 15:58:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822704#M4915</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-10-24T15:58:16Z</dc:date>
    </item>
    <item>
      <title> </title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822705#M4916</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 14.3999996185303px;"&gt;Yes and moreover as HT enables two hardware threads to be executed at the same because of doubled architectural state. &amp;nbsp;Execution units of the CPU are shared (FP and SIMD stack) between those two threads and if there is no instruction interdependecis only one of those threads can at the same time *issue fmul and fadd thread ID tagged uops.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 14.3999996185303px;"&gt;*issue - Scheduler will issue thread ID tagged uops.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 27 Oct 2014 06:18:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822705#M4916</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-10-27T06:18:28Z</dc:date>
    </item>
    <item>
      <title>To summarize some of the</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822706#M4917</link>
      <description>&lt;P&gt;To summarize some of the above:&lt;/P&gt;

&lt;P&gt;Current CPUs aren't like 30 years ago, where it made sense to combine library calls like VML, in the absence of threading and cacheing.&amp;nbsp; You can achieve much better performance by allowing a compiler to see more of the picture and eliminate unnecessary memory traffic.&lt;/P&gt;

&lt;P&gt;Sergey suggested you should use (and compile for) an AVX2 CPU.&amp;nbsp; This could increase the margin of performance a compiler could achieve vs. a series of VML calls.&amp;nbsp; If you are using a core 2 duo (I missed the hints about that), VML doesn't have much latitude to use too many threads or choose an ineffective instruction set.&amp;nbsp; You may still want to compile for sse4.1 if you have one of the later core 2 duos which supports that, and you are willing to spend a few minutes looking at compiler options.&lt;/P&gt;</description>
      <pubDate>Tue, 28 Oct 2014 21:36:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-performance-very-slow/m-p/822706#M4917</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-10-28T21:36:04Z</dc:date>
    </item>
  </channel>
</rss>

