<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic In addition to the in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dnrm2-much-slower-than-cblas-ddot/m-p/1022226#M19738</link>
    <description>&lt;P&gt;In addition to the implementation difference that mecej mentioned, dnrm2 is not threaded in MKL currently whereas ddot is multithreaded. We are currently looking into improving dnrm2 serial/multithreaded performance for the upcoming MKL releases.&lt;/P&gt;</description>
    <pubDate>Mon, 27 Jul 2015 18:08:26 GMT</pubDate>
    <dc:creator>Murat_G_Intel</dc:creator>
    <dc:date>2015-07-27T18:08:26Z</dc:date>
    <item>
      <title>cblas_dnrm2 much slower than cblas_ddot</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dnrm2-much-slower-than-cblas-ddot/m-p/1022221#M19733</link>
      <description>&lt;P&gt;Dear all,&lt;/P&gt;

&lt;P&gt;I run benchmarks on a sandy-bridge Intel processor (E5-4620) using Intel MKL 11.1. Here, I have found that cblas_dnrm2 is significantly slower (3.4 s) than the corresponding cblas_ddot call (0.5 s) using one thread. This is very surprising for me, because if I use cblas_ddot to calculate the 2-Norm it is faster (0.3 s) than cblas_dnrm2.&lt;/P&gt;

&lt;P&gt;I have compiled with gcc-4.8.3 with following flags:&lt;/P&gt;

&lt;P&gt;CXXFLAGS += -O3 -I${MKLROOT}/include&lt;/P&gt;

&lt;P&gt;LDLIBS += -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${INTEL_CC_HOME}/compiler/lib/intel64/libiomp5.a -Wl,--end-group -ldl -lpthread -lm&lt;/P&gt;

&lt;P&gt;The files are attached. Is there any known issue for nrm2?&lt;/P&gt;

&lt;P&gt;Best regards,&lt;/P&gt;

&lt;P&gt;Bernd&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 22 Jul 2015 12:39:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dnrm2-much-slower-than-cblas-ddot/m-p/1022221#M19733</guid>
      <dc:creator>Bernd_Doser</dc:creator>
      <dc:date>2015-07-22T12:39:50Z</dc:date>
    </item>
    <item>
      <title>Your conclusions hold as long</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dnrm2-much-slower-than-cblas-ddot/m-p/1022222#M19734</link>
      <description>&lt;P&gt;Your conclusions hold as long as the vector does not simultaneously contain numbers of O(1) and very large/small numbers (I have not looked at your code, just your comments). DNRM2 is designed to handle those pathological cases and the extra code needed to detect and manage such cases slows down DNRM2. Consider, for example, that if the result value from DNRM2 is 1D200 the value returned by DDOT would be +INF, and if the result from DNRM2 is 1D-200 the value returned by DDOT would be 0.&lt;/P&gt;

&lt;P&gt;The comparison is between "robust but slow" and "fast but only correct most of the time".&lt;/P&gt;</description>
      <pubDate>Wed, 22 Jul 2015 14:49:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dnrm2-much-slower-than-cblas-ddot/m-p/1022222#M19734</guid>
      <dc:creator>mecej4</dc:creator>
      <dc:date>2015-07-22T14:49:00Z</dc:date>
    </item>
    <item>
      <title>Dear mecej4,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dnrm2-much-slower-than-cblas-ddot/m-p/1022223#M19735</link>
      <description>&lt;P&gt;Dear mecej4,&lt;/P&gt;

&lt;P&gt;Thank you very much for your answer. That would mean that for handling pathological cases nrm2 is more than 10 times slower than dot. This sound for me as a not so good solution to divide in robust and fast algorithms and I have nothing found about this in the documentation.&lt;/P&gt;

&lt;P&gt;Thanks again and best regards,&lt;/P&gt;

&lt;P&gt;Bernd&lt;/P&gt;</description>
      <pubDate>Thu, 23 Jul 2015 07:39:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dnrm2-much-slower-than-cblas-ddot/m-p/1022223#M19735</guid>
      <dc:creator>Bernd_Doser</dc:creator>
      <dc:date>2015-07-23T07:39:34Z</dc:date>
    </item>
    <item>
      <title>You can see the details in</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dnrm2-much-slower-than-cblas-ddot/m-p/1022224#M19736</link>
      <description>&lt;P&gt;You can see the details in the source code for DNRM2:&amp;nbsp;http://www.netlib.org/lapack/explore-3.1.1-html/dnrm2.f.html or the f2c translation result, &lt;A href="http://www.netlib.org/clapack/cblas/dnrm2.c&amp;nbsp;" target="_blank"&gt;http://www.netlib.org/clapack/cblas/dnrm2.c&amp;nbsp;&lt;/A&gt;. Note that processing each non-zero element of the vector requires two comparisons and one division. In other words, each such element has to be scaled before multiplication and accumulation, and the scale factor may itself need to be updated.&lt;/P&gt;

&lt;P&gt;Fortran 2008 provides a new intrinsic function, NORM2.&lt;/P&gt;</description>
      <pubDate>Thu, 23 Jul 2015 11:29:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dnrm2-much-slower-than-cblas-ddot/m-p/1022224#M19736</guid>
      <dc:creator>mecej4</dc:creator>
      <dc:date>2015-07-23T11:29:00Z</dc:date>
    </item>
    <item>
      <title>Ok! I see it. Looking into</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dnrm2-much-slower-than-cblas-ddot/m-p/1022225#M19737</link>
      <description>&lt;P&gt;Ok! I see it. Looking into source code is always a good idea ;)&lt;/P&gt;

&lt;P&gt;Thanks again very much. Your answer was very helpful.&lt;/P&gt;</description>
      <pubDate>Thu, 23 Jul 2015 11:35:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dnrm2-much-slower-than-cblas-ddot/m-p/1022225#M19737</guid>
      <dc:creator>Bernd_Doser</dc:creator>
      <dc:date>2015-07-23T11:35:32Z</dc:date>
    </item>
    <item>
      <title>In addition to the</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dnrm2-much-slower-than-cblas-ddot/m-p/1022226#M19738</link>
      <description>&lt;P&gt;In addition to the implementation difference that mecej mentioned, dnrm2 is not threaded in MKL currently whereas ddot is multithreaded. We are currently looking into improving dnrm2 serial/multithreaded performance for the upcoming MKL releases.&lt;/P&gt;</description>
      <pubDate>Mon, 27 Jul 2015 18:08:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/cblas-dnrm2-much-slower-than-cblas-ddot/m-p/1022226#M19738</guid>
      <dc:creator>Murat_G_Intel</dc:creator>
      <dc:date>2015-07-27T18:08:26Z</dc:date>
    </item>
  </channel>
</rss>

