<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How does l_mklb work when running across a cluster? in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-does-l-mklb-work-when-running-across-a-cluster/m-p/1143290#M26459</link>
    <description>&lt;P&gt;Hi all,&lt;/P&gt;

&lt;P&gt;Currently attempting to run l_mklb across a 110x node cluster, but I seem to be missing the understanding of the best syntax to run with.&lt;/P&gt;

&lt;P&gt;Relevant items:&lt;/P&gt;

&lt;P&gt;20 Ps, 22 Qs, NB=192, 1237056 Ns...&lt;/P&gt;

&lt;P&gt;Inside the runme_intel64_static I set:&lt;/P&gt;

&lt;P&gt;export MPI_PROC_NUM=440&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;export MPI_PER_NODE=4&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;#mpirun -perhost ${MPI_PER_NODE} -np ${MPI_PROC_NUM} ./runme_intel64_prv "$@" | tee -a $OUT &amp;lt;-- This was the original command&lt;/P&gt;

&lt;P&gt;mpirun -np ${MPI_PROC_NUM} -machinefile hostlist /mnt/shared/benchmarks/runme_intel64_prv "$@" | tee -a $OUT&lt;/P&gt;

&lt;P&gt;Right now on a 110 node cluster with 128GB RAM per node on&amp;nbsp;Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz nodes... I'm seeing starting numbers of around 150TFlops.&lt;/P&gt;

&lt;P&gt;I would expect to see more... So I guess my question is:&lt;/P&gt;

&lt;P&gt;What are the best settings for runme_intel64_static?&lt;/P&gt;

&lt;P&gt;On a normal HPL run i'd set the number of processes to the actual number of cores in the system but if I do that using runme_intel64_static, I totally oversubscribe the nodes and the performance goes through the floor.&lt;/P&gt;

&lt;P&gt;If someone can explain what each variable does inside the script so I can work out how to saturate the cluster efficiently, that would be great.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sun, 09 Jul 2017 17:06:05 GMT</pubDate>
    <dc:creator>Chris_C_2</dc:creator>
    <dc:date>2017-07-09T17:06:05Z</dc:date>
    <item>
      <title>How does l_mklb work when running across a cluster?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-does-l-mklb-work-when-running-across-a-cluster/m-p/1143290#M26459</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;

&lt;P&gt;Currently attempting to run l_mklb across a 110x node cluster, but I seem to be missing the understanding of the best syntax to run with.&lt;/P&gt;

&lt;P&gt;Relevant items:&lt;/P&gt;

&lt;P&gt;20 Ps, 22 Qs, NB=192, 1237056 Ns...&lt;/P&gt;

&lt;P&gt;Inside the runme_intel64_static I set:&lt;/P&gt;

&lt;P&gt;export MPI_PROC_NUM=440&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;export MPI_PER_NODE=4&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;#mpirun -perhost ${MPI_PER_NODE} -np ${MPI_PROC_NUM} ./runme_intel64_prv "$@" | tee -a $OUT &amp;lt;-- This was the original command&lt;/P&gt;

&lt;P&gt;mpirun -np ${MPI_PROC_NUM} -machinefile hostlist /mnt/shared/benchmarks/runme_intel64_prv "$@" | tee -a $OUT&lt;/P&gt;

&lt;P&gt;Right now on a 110 node cluster with 128GB RAM per node on&amp;nbsp;Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz nodes... I'm seeing starting numbers of around 150TFlops.&lt;/P&gt;

&lt;P&gt;I would expect to see more... So I guess my question is:&lt;/P&gt;

&lt;P&gt;What are the best settings for runme_intel64_static?&lt;/P&gt;

&lt;P&gt;On a normal HPL run i'd set the number of processes to the actual number of cores in the system but if I do that using runme_intel64_static, I totally oversubscribe the nodes and the performance goes through the floor.&lt;/P&gt;

&lt;P&gt;If someone can explain what each variable does inside the script so I can work out how to saturate the cluster efficiently, that would be great.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 09 Jul 2017 17:06:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-does-l-mklb-work-when-running-across-a-cluster/m-p/1143290#M26459</guid>
      <dc:creator>Chris_C_2</dc:creator>
      <dc:date>2017-07-09T17:06:05Z</dc:date>
    </item>
    <item>
      <title>Hi Chris,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-does-l-mklb-work-when-running-across-a-cluster/m-p/1143291#M26460</link>
      <description>&lt;P&gt;Hi Chris,&lt;/P&gt;

&lt;P&gt;How about to try let each node had only 1 MPI&amp;nbsp; rank and OpenMP threads using by default.&lt;/P&gt;

&lt;P&gt;export MPI_PROC_NUM='The number of actual physical server, which equals PxQ) may 110 here.&lt;/P&gt;

&lt;P&gt;export MPI_PER_NODE=1&lt;/P&gt;

&lt;P&gt;and other configuration in&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/605789"&gt;https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/605789&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/en-us/articles/performance-tools-for-software-developers-hpl-application-note"&gt;https://software.intel.com/en-us/articles/performance-tools-for-software-developers-hpl-application-note&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;

&lt;P&gt;Ying&lt;/P&gt;</description>
      <pubDate>Tue, 11 Jul 2017 08:20:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-does-l-mklb-work-when-running-across-a-cluster/m-p/1143291#M26460</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2017-07-11T08:20:00Z</dc:date>
    </item>
  </channel>
</rss>

