<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Performance issue in vdSqrt (Math Kernel Library 10.3) in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-issue-in-vdSqrt-Math-Kernel-Library-10-3/m-p/826449#M5144</link>
    <description>It looks like vdSqrt does not work correctly in multi-threaded mode. Its performance is significantly slower than a simple loop of sqrt() calls. Moreover, it ignores OMP_NUM_THREADS completely.&lt;BR /&gt;&lt;BR /&gt;I have made the attached test program to reproduce the bug, which had a dramatic impact and was hard to spot in amore complex application. This simple program computes the square root of a vector of 10000 elements for 5000 times. In order to prevent the compiler from removing code, some randomy selected items are added to a variable which is later printed to the standard output.&lt;BR /&gt;&lt;BR /&gt;Three different ways of computing the square root are tested:&lt;BR /&gt;1) vdSqrt&lt;BR /&gt;2) loop of sqrt()&lt;BR /&gt;3) OpenMP parallel loop of sqrt()&lt;BR /&gt;&lt;BR /&gt;Since I'm on a dual core machine (Core i5 430M), I run the code with OMP_NUM_THREADS=1 and OMP_NUM_THREADS=2. This measurements are obtained in a VMware virtual machine but the same behavior was observed on physical machines as well. The issue was reproduced on Debian 5, Ubuntu 8.04 and Ubuntu 10.10.&lt;BR /&gt;&lt;BR /&gt; OMP_NUM_THREADS=1 OMP_NUM_THREADS=2&lt;BR /&gt;1) 0.99 s   0.96 s&lt;BR /&gt;2) 0.33 s   0.33 s&lt;BR /&gt;3) 0.34 s   0.21 s&lt;BR /&gt;&lt;BR /&gt;If I force the code to run on fewer core by setting processor affinity with taskset the results are vastly different for vdSqrt:&lt;BR /&gt; taskset -c 0  taskset -c 0,2&lt;BR /&gt;1) 0.33 s   1.19 s&lt;BR /&gt;&lt;BR /&gt;In a more complex program, replacing vdSqrt with a loop of sqrt yielded a 9x speedup.&lt;BR /&gt;&lt;BR /&gt;Performing a concurrency analysis with VTunes gives a better insight on the problem. While the third method and other MKL routines such as dgemv and dgemm only spawn a single worker thread with OMP_NUM_THREADS=2, vdSqrt spawns about 20 worker threads. The attached pictures highlights the issue.</description>
    <pubDate>Tue, 11 Jan 2011 08:18:21 GMT</pubDate>
    <dc:creator>msolus</dc:creator>
    <dc:date>2011-01-11T08:18:21Z</dc:date>
    <item>
      <title>Performance issue in vdSqrt (Math Kernel Library 10.3)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-issue-in-vdSqrt-Math-Kernel-Library-10-3/m-p/826449#M5144</link>
      <description>It looks like vdSqrt does not work correctly in multi-threaded mode. Its performance is significantly slower than a simple loop of sqrt() calls. Moreover, it ignores OMP_NUM_THREADS completely.&lt;BR /&gt;&lt;BR /&gt;I have made the attached test program to reproduce the bug, which had a dramatic impact and was hard to spot in amore complex application. This simple program computes the square root of a vector of 10000 elements for 5000 times. In order to prevent the compiler from removing code, some randomy selected items are added to a variable which is later printed to the standard output.&lt;BR /&gt;&lt;BR /&gt;Three different ways of computing the square root are tested:&lt;BR /&gt;1) vdSqrt&lt;BR /&gt;2) loop of sqrt()&lt;BR /&gt;3) OpenMP parallel loop of sqrt()&lt;BR /&gt;&lt;BR /&gt;Since I'm on a dual core machine (Core i5 430M), I run the code with OMP_NUM_THREADS=1 and OMP_NUM_THREADS=2. This measurements are obtained in a VMware virtual machine but the same behavior was observed on physical machines as well. The issue was reproduced on Debian 5, Ubuntu 8.04 and Ubuntu 10.10.&lt;BR /&gt;&lt;BR /&gt; OMP_NUM_THREADS=1 OMP_NUM_THREADS=2&lt;BR /&gt;1) 0.99 s   0.96 s&lt;BR /&gt;2) 0.33 s   0.33 s&lt;BR /&gt;3) 0.34 s   0.21 s&lt;BR /&gt;&lt;BR /&gt;If I force the code to run on fewer core by setting processor affinity with taskset the results are vastly different for vdSqrt:&lt;BR /&gt; taskset -c 0  taskset -c 0,2&lt;BR /&gt;1) 0.33 s   1.19 s&lt;BR /&gt;&lt;BR /&gt;In a more complex program, replacing vdSqrt with a loop of sqrt yielded a 9x speedup.&lt;BR /&gt;&lt;BR /&gt;Performing a concurrency analysis with VTunes gives a better insight on the problem. While the third method and other MKL routines such as dgemv and dgemm only spawn a single worker thread with OMP_NUM_THREADS=2, vdSqrt spawns about 20 worker threads. The attached pictures highlights the issue.</description>
      <pubDate>Tue, 11 Jan 2011 08:18:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-issue-in-vdSqrt-Math-Kernel-Library-10-3/m-p/826449#M5144</guid>
      <dc:creator>msolus</dc:creator>
      <dc:date>2011-01-11T08:18:21Z</dc:date>
    </item>
    <item>
      <title>Performance issue in vdSqrt (Math Kernel Library 10.3)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-issue-in-vdSqrt-Math-Kernel-Library-10-3/m-p/826450#M5145</link>
      <description>Seemslike the problem size is too small and that's the reason you don't see much speedup in vdSqrt. Can you increase the size of the vectorand check?&lt;BR /&gt;&lt;BR /&gt;--Vipin</description>
      <pubDate>Tue, 11 Jan 2011 09:02:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-issue-in-vdSqrt-Math-Kernel-Library-10-3/m-p/826450#M5145</guid>
      <dc:creator>VipinKumar_E_Intel</dc:creator>
      <dc:date>2011-01-11T09:02:46Z</dc:date>
    </item>
    <item>
      <title>Performance issue in vdSqrt (Math Kernel Library 10.3)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-issue-in-vdSqrt-Math-Kernel-Library-10-3/m-p/826451#M5146</link>
      <description>This is a knownissue with MKL 10.3 and has been fixed in MKL 10.3 update 1.&lt;BR /&gt;Please refer this KB for more details.&lt;BR /&gt;&lt;BR /&gt;&lt;A href="http://software.intel.com/en-us/articles/vml-routines-shows-a-noticeable-decrease-in-the-rate-of-this-function-in-case-of-multi-threading" target="_blank"&gt;http://software.intel.com/en-us/articles/vml-routines-shows-a-noticeable-decrease-in-the-rate-of-this-function-in-case-of-multi-threading&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;--Vipin</description>
      <pubDate>Tue, 11 Jan 2011 09:17:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-issue-in-vdSqrt-Math-Kernel-Library-10-3/m-p/826451#M5146</guid>
      <dc:creator>VipinKumar_E_Intel</dc:creator>
      <dc:date>2011-01-11T09:17:22Z</dc:date>
    </item>
  </channel>
</rss>

