<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic BLAS performance vs naive c code in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/BLAS-performance-vs-naive-c-code/m-p/763594#M76</link>
    <description>The usual case for daxpy to give a performance advantage is for the reputed 90% of programmers using a compiler without auto-vectorization, or a case of failed vectorization (possibly due to a compiler being concerned about operand overlap). It's not possible for daxpy to out-perform correctly auto-vectorized code, at least not when there is no opportunity to gain by vector parallel execution. Even on the Xeon 7550, to see a significant gain from threading, the structs would have to be much larger and accessed consistently (from first touch) by threads affinitized to the 2 sockets. &lt;BR /&gt;I suppose you'd have to analyze the actual example to try to find out what optimizations are missed with the struct of pointers. As far as I know, it's not a common programming model which would appear in a corpus of code for which the compiler is performance tested. You could experiment by making an explicit local plain pointer copy just ahead of the loop.</description>
    <pubDate>Thu, 14 Apr 2011 13:06:35 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2011-04-14T13:06:35Z</dc:date>
    <item>
      <title>BLAS performance vs naive c code</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/BLAS-performance-vs-naive-c-code/m-p/763593#M75</link>
      <description>Hi&lt;BR /&gt;&lt;BR /&gt;I am a new BLAS user, trying to improve c code for solving a time dependent 2D wave equation (PML absorbing boundaries) by replacing some of my loops with cBLAS functions. Just to get a feel, I started by concentrating on one code block within the program, the block for updating p below.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;P&gt;struct grid {&lt;BR /&gt; double dt;&lt;BR /&gt; int ny; &lt;BR /&gt; int nx; &lt;BR /&gt; and other grid info ....&lt;BR /&gt;};&lt;BR /&gt;&lt;BR /&gt;struct pmlwaves{&lt;BR /&gt;double *p;&lt;BR /&gt; double *pdot;&lt;BR /&gt;otherpointers to double neededfor the PML absorbing boundary method...&lt;BR /&gt;};&lt;/P&gt;&lt;BR /&gt;main(){&lt;BR /&gt;&lt;BR /&gt;struct pmlwaves w;&lt;BR /&gt;struct grid g;&lt;BR /&gt;int k;&lt;BR /&gt;&lt;BR /&gt;different initializations....&lt;BR /&gt;&lt;BR /&gt;for(tstep=0; tstep&lt;NUM_TSTEPS&gt;&lt;/NUM_TSTEPS&gt;&lt;BR /&gt;different calculations to find pdot...&lt;BR /&gt;&lt;BR /&gt; /*option 1:naive code */&lt;BR /&gt; for(k=0; k&lt;G.NX&gt;&lt;/G.NX&gt; w.p&lt;K&gt; =w.p&lt;K&gt; + g.dt* w.pdot&lt;K&gt;; /* advance solution one time step */&lt;BR /&gt;&lt;BR /&gt; /* option 2: cBLAS */&lt;BR /&gt;cblas_daxpy(g.ny*g.nx, g.dt, w.pdot,1,w.p,1); /* advance solution one time step */&lt;BR /&gt;&lt;BR /&gt; different cacluations arising from PML absorbing boundary method....&lt;BR /&gt;&lt;BR /&gt;} /*end time stepping loop */&lt;BR /&gt;&lt;BR /&gt;} /* end main() */&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;This was compiled with icc on a multi-core (Xeon 7550) machine, the exact compilation command was&lt;BR /&gt;&lt;BR /&gt;icc myfile.c -O2 -mkl -openmp&lt;BR /&gt;&lt;BR /&gt;The values used were g.nx = g.ny = 481, num_tsteps = 7500.&lt;BR /&gt;&lt;BR /&gt;I used the openmp function omp_get_wtime() to measure the wall clock execution time of this code block and to accumulate these times throughout the time stepping loop.&lt;BR /&gt;&lt;BR /&gt;The following (surprising?) results were obtained:&lt;BR /&gt;1) Option 1 accumulated time: around 3 sec.&lt;BR /&gt;2) Option 2 accumulated time: around 90 sec!!&lt;BR /&gt;3) When w.p and w.pdot were replaced with "regular" pointers (which are not fields of a structure) the time&lt;BR /&gt; of option 2 was around 3 sec (just like the naive loop in option 1).&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;My questions:&lt;BR /&gt;1) Why the incredibly long running time when the pointer arguments to cblas_daxpy were fields in a structure? They just pass an address to daxpy dont they? Why doesent cblas_daxpy regard w.p and w.pdot just as pointers to double (as they are)?&lt;BR /&gt;&lt;BR /&gt;2) When comparing items 1 and 3 in the results, why doesent cblas_daxpy offer any advantage over the naive loop? Did I ommitt any flags/options to the compiler?&lt;BR /&gt;&lt;BR /&gt;3) There seems to be a whole lot of things to know about proper running of BLAS/cBLAS, mainly about compiler options/flags and makefile issues and their compatibility to a specific machine. Where can all this be learned? Is there some good resource/website/book?&lt;BR /&gt;&lt;BR /&gt;Thanks a lot&lt;BR /&gt;&lt;BR /&gt;Jake&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/K&gt;&lt;/K&gt;&lt;/K&gt;</description>
      <pubDate>Thu, 14 Apr 2011 06:14:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/BLAS-performance-vs-naive-c-code/m-p/763593#M75</guid>
      <dc:creator>jakerr</dc:creator>
      <dc:date>2011-04-14T06:14:17Z</dc:date>
    </item>
    <item>
      <title>BLAS performance vs naive c code</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/BLAS-performance-vs-naive-c-code/m-p/763594#M76</link>
      <description>The usual case for daxpy to give a performance advantage is for the reputed 90% of programmers using a compiler without auto-vectorization, or a case of failed vectorization (possibly due to a compiler being concerned about operand overlap). It's not possible for daxpy to out-perform correctly auto-vectorized code, at least not when there is no opportunity to gain by vector parallel execution. Even on the Xeon 7550, to see a significant gain from threading, the structs would have to be much larger and accessed consistently (from first touch) by threads affinitized to the 2 sockets. &lt;BR /&gt;I suppose you'd have to analyze the actual example to try to find out what optimizations are missed with the struct of pointers. As far as I know, it's not a common programming model which would appear in a corpus of code for which the compiler is performance tested. You could experiment by making an explicit local plain pointer copy just ahead of the loop.</description>
      <pubDate>Thu, 14 Apr 2011 13:06:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/BLAS-performance-vs-naive-c-code/m-p/763594#M76</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-04-14T13:06:35Z</dc:date>
    </item>
    <item>
      <title>BLAS performance vs naive c code</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/BLAS-performance-vs-naive-c-code/m-p/763595#M77</link>
      <description>&lt;P&gt;Hi Jake,&lt;BR /&gt;&lt;BR /&gt;As you mentioned, one would expect that using pointers or pointers in structs would not make a dramatic performance difference. I verified this using your code. For me, there was no performance difference between pointers and pointers in structs. One thing to pay attantion here is how the pointers are allocated/initialized. Proper allignment of the pointer addresses usually improves the performance. You could try using mkl_malloc for allocating alligned memory.&lt;BR /&gt;&lt;BR /&gt;MKL daxpy binaries are compiled with the optimal compiler flags. Therefore, the compiler flags you use in your program should not make a dramatic difference on the MKL daxpy performance. &lt;BR /&gt;&lt;BR /&gt;Intel Math Kernel Library for Linux* OS Users Guide provides guidelines for using MKL and summarizes factors that affect performance. &lt;BR /&gt;&lt;BR /&gt;Thanks,&lt;BR /&gt;&lt;BR /&gt;Efe&lt;/P&gt;</description>
      <pubDate>Thu, 14 Apr 2011 16:51:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/BLAS-performance-vs-naive-c-code/m-p/763595#M77</guid>
      <dc:creator>Murat_G_Intel</dc:creator>
      <dc:date>2011-04-14T16:51:47Z</dc:date>
    </item>
  </channel>
</rss>

