<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic VML does not use all available threads in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-does-not-use-all-available-threads/m-p/912897#M12290</link>
    <description>Hi,&lt;BR /&gt;&lt;BR /&gt;I wanted to compare the speed of the vml function vzExp to the normal complex&lt;DOUBLE&gt; exponential function of libm.&lt;BR /&gt;&lt;BR /&gt;I wrote a simple timing program to compare these for different array lengths, as wel as different number of used threads.&lt;BR /&gt;&lt;BR /&gt;I ran this on our cluster which has 2 xeon dual core processors per node, so it has 4 cores, so 4 threads.&lt;BR /&gt;&lt;BR /&gt;For some strange reason, vml only uses 2 of the 4 threads, and this for more then 1000 000 elements in the array. What could be the reason of this?&lt;BR /&gt;&lt;BR /&gt;Could the reason be that the master of our cluster only has one xeon processor, and that I installed it on that one in the opt directory, and later shared this opt directory to the nodes?&lt;BR /&gt;&lt;BR /&gt;I have placed the program below,&lt;BR /&gt;&lt;BR /&gt;This is how I compiled it&lt;BR /&gt;icpc -o timing -O3 -openmp timing.cpp -lvml&lt;BR /&gt;&lt;BR /&gt;This is how I ran it&lt;BR /&gt;export OMP_NUM_THREADS=4&lt;BR /&gt;./timing&lt;BR /&gt;&lt;BR /&gt;You clearly see when vml starts to use the second thread, but never uses the other two.&lt;BR /&gt;&lt;BR /&gt;Thanks in advance&lt;BR /&gt;klaas&lt;BR /&gt;&lt;BR /&gt;&lt;PRE&gt;#include &lt;IOSTREAM&gt;&lt;BR /&gt;#include &lt;COMPLEX&gt;&lt;BR /&gt;#include &lt;CSTDLIB&gt;&lt;BR /&gt;#include &lt;CTIME&gt;&lt;BR /&gt;#include &lt;OMP.H&gt;&lt;BR /&gt;&lt;BR /&gt;#include &lt;MKL.H&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;int main(void) {&lt;BR /&gt;&lt;BR /&gt;  unsigned long int N = 1000000;&lt;BR /&gt;  unsigned long int M;&lt;BR /&gt;&lt;BR /&gt;  std::complex&lt;DOUBLE&gt; *c;&lt;BR /&gt;  std::complex&lt;DOUBLE&gt; *z;&lt;BR /&gt;&lt;BR /&gt;  c = new std::complex&lt;DOUBLE&gt; &lt;N&gt;;&lt;BR /&gt;  z = new std::complex&lt;DOUBLE&gt; &lt;N&gt;;&lt;BR /&gt;&lt;BR /&gt;  double t1,t2;&lt;BR /&gt;  double T;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;  for (register unsigned long int k = 0; k &amp;lt; N; ++k)&lt;BR /&gt;    c&lt;K&gt; = std::complex&lt;DOUBLE&gt;(rand()/double(RAND_MAX),rand()/double(RAND_MAX));&lt;BR /&gt;&lt;BR /&gt;  long int delta = 1;&lt;BR /&gt;&lt;BR /&gt;  for (register long int k = 1; k &amp;lt;= N; k+=delta) {&lt;BR /&gt;    if (k == 10*delta) delta *= 10;&lt;BR /&gt;    M = N*10/k;&lt;BR /&gt;&lt;BR /&gt;    // the libm version&lt;BR /&gt;    t1 = omp_get_wtime();&lt;BR /&gt;    for (register int i = 0; i &amp;lt; M; ++i)&lt;BR /&gt;    for (register int l = 0; l &amp;lt; k; ++l)&lt;BR /&gt;      z&lt;L&gt; = exp(c&lt;L&gt;);&lt;BR /&gt;    t2 = omp_get_wtime();&lt;BR /&gt;&lt;BR /&gt;    T = (t2 - t1)/M;&lt;BR /&gt;&lt;BR /&gt;    std::cout &amp;lt;&amp;lt; k &amp;lt;&amp;lt; "	" &amp;lt;&amp;lt; T &amp;lt;&amp;lt; "	" &amp;lt;&amp;lt; T/k &amp;lt;&amp;lt; "	";&lt;BR /&gt;&lt;BR /&gt;    for (register int omp = 1; omp &amp;lt;= 4; omp *= 2) {&lt;BR /&gt;      omp_set_num_threads(omp);&lt;BR /&gt;&lt;BR /&gt;      t1 = omp_get_wtime();&lt;BR /&gt;      for (register int i = 0; i &amp;lt; M; ++i)&lt;BR /&gt;#pragma omp parallel for&lt;BR /&gt;        for (register int l = 0; l &amp;lt; k; ++l)&lt;BR /&gt;          z&lt;L&gt; = exp(c&lt;L&gt;);&lt;BR /&gt;      t2 = omp_get_wtime();&lt;BR /&gt;      T = (t2 - t1)/M;&lt;BR /&gt;&lt;BR /&gt;      std::cout &amp;lt;&amp;lt; T &amp;lt;&amp;lt; "	" &amp;lt;&amp;lt; T/k &amp;lt;&amp;lt; "	";&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;      // the mkl version&lt;BR /&gt;      vmlSetMode(VML_HA);&lt;BR /&gt;      t1 = omp_get_wtime();&lt;BR /&gt;      for (register int i = 0; i &amp;lt; M; ++i)&lt;BR /&gt;       vzExp(k,(MKL_Complex16 *) c, (MKL_Complex16 *) z);&lt;BR /&gt;      t2 = omp_get_wtime();&lt;BR /&gt;      T = (t2 - t1)/M;&lt;BR /&gt;&lt;BR /&gt;      std::cout &amp;lt;&amp;lt; T &amp;lt;&amp;lt; "	" &amp;lt;&amp;lt; T/k &amp;lt;&amp;lt; "	";&lt;BR /&gt;&lt;BR /&gt;      // the mkl version&lt;BR /&gt;      vmlSetMode(VML_LA);&lt;BR /&gt;      t1 = omp_get_wtime();&lt;BR /&gt;      for (register int i = 0; i &amp;lt; M; ++i)&lt;BR /&gt;        vzExp(k,(MKL_Complex16 *) c, (MKL_Complex16 *) z);&lt;BR /&gt;      t2 = omp_get_wtime();&lt;BR /&gt;      T = (t2 -t1)/M;&lt;BR /&gt;&lt;BR /&gt;      std::cout &amp;lt;&amp;lt; T &amp;lt;&amp;lt; "	" &amp;lt;&amp;lt; T/k &amp;lt;&amp;lt; "	";&lt;BR /&gt;    }&lt;BR /&gt;&lt;BR /&gt;    std::cout &amp;lt;&amp;lt; std::endl;&lt;BR /&gt;  }&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;  return 0;&lt;BR /&gt;}&lt;BR /&gt;&lt;/L&gt;&lt;/L&gt;&lt;/L&gt;&lt;/L&gt;&lt;/DOUBLE&gt;&lt;/K&gt;&lt;/N&gt;&lt;/DOUBLE&gt;&lt;/N&gt;&lt;/DOUBLE&gt;&lt;/DOUBLE&gt;&lt;/DOUBLE&gt;&lt;/MKL.H&gt;&lt;/OMP.H&gt;&lt;/CTIME&gt;&lt;/CSTDLIB&gt;&lt;/COMPLEX&gt;&lt;/IOSTREAM&gt;&lt;/PRE&gt;&lt;/DOUBLE&gt;</description>
    <pubDate>Wed, 15 Aug 2007 14:45:22 GMT</pubDate>
    <dc:creator>kvtournh1</dc:creator>
    <dc:date>2007-08-15T14:45:22Z</dc:date>
    <item>
      <title>VML does not use all available threads</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-does-not-use-all-available-threads/m-p/912897#M12290</link>
      <description>Hi,&lt;BR /&gt;&lt;BR /&gt;I wanted to compare the speed of the vml function vzExp to the normal complex&lt;DOUBLE&gt; exponential function of libm.&lt;BR /&gt;&lt;BR /&gt;I wrote a simple timing program to compare these for different array lengths, as wel as different number of used threads.&lt;BR /&gt;&lt;BR /&gt;I ran this on our cluster which has 2 xeon dual core processors per node, so it has 4 cores, so 4 threads.&lt;BR /&gt;&lt;BR /&gt;For some strange reason, vml only uses 2 of the 4 threads, and this for more then 1000 000 elements in the array. What could be the reason of this?&lt;BR /&gt;&lt;BR /&gt;Could the reason be that the master of our cluster only has one xeon processor, and that I installed it on that one in the opt directory, and later shared this opt directory to the nodes?&lt;BR /&gt;&lt;BR /&gt;I have placed the program below,&lt;BR /&gt;&lt;BR /&gt;This is how I compiled it&lt;BR /&gt;icpc -o timing -O3 -openmp timing.cpp -lvml&lt;BR /&gt;&lt;BR /&gt;This is how I ran it&lt;BR /&gt;export OMP_NUM_THREADS=4&lt;BR /&gt;./timing&lt;BR /&gt;&lt;BR /&gt;You clearly see when vml starts to use the second thread, but never uses the other two.&lt;BR /&gt;&lt;BR /&gt;Thanks in advance&lt;BR /&gt;klaas&lt;BR /&gt;&lt;BR /&gt;&lt;PRE&gt;#include &lt;IOSTREAM&gt;&lt;BR /&gt;#include &lt;COMPLEX&gt;&lt;BR /&gt;#include &lt;CSTDLIB&gt;&lt;BR /&gt;#include &lt;CTIME&gt;&lt;BR /&gt;#include &lt;OMP.H&gt;&lt;BR /&gt;&lt;BR /&gt;#include &lt;MKL.H&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;int main(void) {&lt;BR /&gt;&lt;BR /&gt;  unsigned long int N = 1000000;&lt;BR /&gt;  unsigned long int M;&lt;BR /&gt;&lt;BR /&gt;  std::complex&lt;DOUBLE&gt; *c;&lt;BR /&gt;  std::complex&lt;DOUBLE&gt; *z;&lt;BR /&gt;&lt;BR /&gt;  c = new std::complex&lt;DOUBLE&gt; &lt;N&gt;;&lt;BR /&gt;  z = new std::complex&lt;DOUBLE&gt; &lt;N&gt;;&lt;BR /&gt;&lt;BR /&gt;  double t1,t2;&lt;BR /&gt;  double T;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;  for (register unsigned long int k = 0; k &amp;lt; N; ++k)&lt;BR /&gt;    c&lt;K&gt; = std::complex&lt;DOUBLE&gt;(rand()/double(RAND_MAX),rand()/double(RAND_MAX));&lt;BR /&gt;&lt;BR /&gt;  long int delta = 1;&lt;BR /&gt;&lt;BR /&gt;  for (register long int k = 1; k &amp;lt;= N; k+=delta) {&lt;BR /&gt;    if (k == 10*delta) delta *= 10;&lt;BR /&gt;    M = N*10/k;&lt;BR /&gt;&lt;BR /&gt;    // the libm version&lt;BR /&gt;    t1 = omp_get_wtime();&lt;BR /&gt;    for (register int i = 0; i &amp;lt; M; ++i)&lt;BR /&gt;    for (register int l = 0; l &amp;lt; k; ++l)&lt;BR /&gt;      z&lt;L&gt; = exp(c&lt;L&gt;);&lt;BR /&gt;    t2 = omp_get_wtime();&lt;BR /&gt;&lt;BR /&gt;    T = (t2 - t1)/M;&lt;BR /&gt;&lt;BR /&gt;    std::cout &amp;lt;&amp;lt; k &amp;lt;&amp;lt; "	" &amp;lt;&amp;lt; T &amp;lt;&amp;lt; "	" &amp;lt;&amp;lt; T/k &amp;lt;&amp;lt; "	";&lt;BR /&gt;&lt;BR /&gt;    for (register int omp = 1; omp &amp;lt;= 4; omp *= 2) {&lt;BR /&gt;      omp_set_num_threads(omp);&lt;BR /&gt;&lt;BR /&gt;      t1 = omp_get_wtime();&lt;BR /&gt;      for (register int i = 0; i &amp;lt; M; ++i)&lt;BR /&gt;#pragma omp parallel for&lt;BR /&gt;        for (register int l = 0; l &amp;lt; k; ++l)&lt;BR /&gt;          z&lt;L&gt; = exp(c&lt;L&gt;);&lt;BR /&gt;      t2 = omp_get_wtime();&lt;BR /&gt;      T = (t2 - t1)/M;&lt;BR /&gt;&lt;BR /&gt;      std::cout &amp;lt;&amp;lt; T &amp;lt;&amp;lt; "	" &amp;lt;&amp;lt; T/k &amp;lt;&amp;lt; "	";&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;      // the mkl version&lt;BR /&gt;      vmlSetMode(VML_HA);&lt;BR /&gt;      t1 = omp_get_wtime();&lt;BR /&gt;      for (register int i = 0; i &amp;lt; M; ++i)&lt;BR /&gt;       vzExp(k,(MKL_Complex16 *) c, (MKL_Complex16 *) z);&lt;BR /&gt;      t2 = omp_get_wtime();&lt;BR /&gt;      T = (t2 - t1)/M;&lt;BR /&gt;&lt;BR /&gt;      std::cout &amp;lt;&amp;lt; T &amp;lt;&amp;lt; "	" &amp;lt;&amp;lt; T/k &amp;lt;&amp;lt; "	";&lt;BR /&gt;&lt;BR /&gt;      // the mkl version&lt;BR /&gt;      vmlSetMode(VML_LA);&lt;BR /&gt;      t1 = omp_get_wtime();&lt;BR /&gt;      for (register int i = 0; i &amp;lt; M; ++i)&lt;BR /&gt;        vzExp(k,(MKL_Complex16 *) c, (MKL_Complex16 *) z);&lt;BR /&gt;      t2 = omp_get_wtime();&lt;BR /&gt;      T = (t2 -t1)/M;&lt;BR /&gt;&lt;BR /&gt;      std::cout &amp;lt;&amp;lt; T &amp;lt;&amp;lt; "	" &amp;lt;&amp;lt; T/k &amp;lt;&amp;lt; "	";&lt;BR /&gt;    }&lt;BR /&gt;&lt;BR /&gt;    std::cout &amp;lt;&amp;lt; std::endl;&lt;BR /&gt;  }&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;  return 0;&lt;BR /&gt;}&lt;BR /&gt;&lt;/L&gt;&lt;/L&gt;&lt;/L&gt;&lt;/L&gt;&lt;/DOUBLE&gt;&lt;/K&gt;&lt;/N&gt;&lt;/DOUBLE&gt;&lt;/N&gt;&lt;/DOUBLE&gt;&lt;/DOUBLE&gt;&lt;/DOUBLE&gt;&lt;/MKL.H&gt;&lt;/OMP.H&gt;&lt;/CTIME&gt;&lt;/CSTDLIB&gt;&lt;/COMPLEX&gt;&lt;/IOSTREAM&gt;&lt;/PRE&gt;&lt;/DOUBLE&gt;</description>
      <pubDate>Wed, 15 Aug 2007 14:45:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-does-not-use-all-available-threads/m-p/912897#M12290</guid>
      <dc:creator>kvtournh1</dc:creator>
      <dc:date>2007-08-15T14:45:22Z</dc:date>
    </item>
    <item>
      <title>Re: VML does not use all available threads</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-does-not-use-all-available-threads/m-p/912898#M12291</link>
      <description>&lt;P&gt;VML threading in MKL9 (the version you presumably use) containedbug fix for whichwill beincluded into the nearestrelease of the library.Thanks, Andrey&lt;/P&gt;</description>
      <pubDate>Thu, 16 Aug 2007 06:37:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-does-not-use-all-available-threads/m-p/912898#M12291</guid>
      <dc:creator>Andrey_N_Intel</dc:creator>
      <dc:date>2007-08-16T06:37:09Z</dc:date>
    </item>
    <item>
      <title>Re: VML does not use all available threads</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-does-not-use-all-available-threads/m-p/912899#M12292</link>
      <description>Thanks for the response.&lt;BR /&gt;When do you think we can expect this bug fix?&lt;BR /&gt;</description>
      <pubDate>Sat, 18 Aug 2007 13:59:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-does-not-use-all-available-threads/m-p/912899#M12292</guid>
      <dc:creator>kvtournh1</dc:creator>
      <dc:date>2007-08-18T13:59:07Z</dc:date>
    </item>
    <item>
      <title>Re: VML does not use all available threads</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-does-not-use-all-available-threads/m-p/912900#M12293</link>
      <description>This fix will be available in MKL from 9.1.1 and 10.0 beta.</description>
      <pubDate>Mon, 20 Aug 2007 08:40:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/VML-does-not-use-all-available-threads/m-p/912900#M12293</guid>
      <dc:creator>Andrey_G_Intel2</dc:creator>
      <dc:date>2007-08-20T08:40:53Z</dc:date>
    </item>
  </channel>
</rss>

